DVCFileSystem
DVCFileSystem provides a pythonic file interface (
fsspec-compatible) for a DVC repo. It
is a read-only filesystem, hence it does not support any write operations, like
put_file
, cp
, rm
, mv
, mkdir
etc.
class DVCFileSystem(AbstractFileSystem):
def __init__(
self,
url: Optional[str] = None,
rev: Optional[str] = None,
config: Optional[Dict[str, Any]] = None,
**kwargs,
):
DVCFileSystem provides a unified view of all the files/directories in your repository, be it Git-tracked or DVC-tracked, or untracked (in case of a local repository). It can reuse the files in DVC cache and can otherwise stream from supported remote storage.
>>> from dvc.api import DVCFileSystem
# opening a local repository
>>> fs = DVCFileSystem("/path/to/local/repository")
# opening a remote repository
>>> url = "https://github.com/iterative/example-get-started.git"
>>> fs = DVCFileSystem(url, rev="main")
Parameters
-
url
- optional URL or local path to the DVC project. If unspecified, the DVC project in current working directory is used. (Equivalent to therepo
argument indvc.api.open()
anddvc.api.read()
) -
rev
- optional Git commit (any revision such as a branch or a tag name, a commit hash, or an experiment name). -
config
optional config dictionary to pass through to the DVC project.
Opening a file
>>> with fs.open("model.pkl") as f:
model = pickle.load(f)
This is similar to dvc.api.open()
which returns a file-like object. Note that,
unlike dvc.api.open()
, the mode
defaults to binary mode, i.e. "rb"
. You
can also specify encoding
argument in case of text mode ("r"
).
Reading a file
>>> text = fs.read_text("get-started/data.xml", encoding="utf-8")
This is similar to dvc.api.read()
, which returns the contents of the file as a
string.
To get the binary contents of the file, you can use read_bytes()
or
cat_file()
.
>>> contents = fs.read_bytes("get-started/data.xml")
Listing all DVC-tracked files recursively
>>> fs.find("/", detail=False, dvc_only=True)
[
'/data/data.xml',
'/data/features/test.pkl',
'/data/features/train.pkl',
'/data/prepared/test.tsv',
'/data/prepared/train.tsv',
'/evaluation/importance.png',
'/model.pkl'
]
This is similar to dvc ls --recursive --dvc-only
CLI command. Note that the
"/"
is considered as the root of the Git repo. You can specify sub-paths to
only return entries in that directory. Similarly, there is fs.ls()
that is
non-recursive.
Listing all files (including Git-tracked)
>>> fs.find("/", detail=False)
[
...
'/.gitignore',
'/README.md',
'/data/.gitignore',
'/data/data.xml',
'/data/features/test.pkl',
'/data/features/train.pkl',
'/data/prepared/test.tsv',
'/data/prepared/train.tsv',
...
'/evaluation/.gitignore',
'/evaluation/importance.png',
'/evaluation/plots/confusion_matrix.json',
'/evaluation/plots/precision_recall.json',
'/evaluation/plots/roc.json',
'/model.pkl',
...
]
This is similar to dvc ls --recursive
CLI command. It returns all of the files
tracked by DVC and Git and if filesystem is opened locally, it also includes the
local untracked files.
Downloading a file or a directory
>>> fs.get_file("data/data.xml", "data.xml")
This downloads "data/data.xml" file to the current working directory as "data.xml" file. The DVC-tracked files may be downloaded from the cache if it exists or may get streamed from the remote.
>>> fs.get("data", "data", recursive=True)
This downloads all the files in "data" directory - be it Git-tracked or DVC-tracked into a local directory "data". Similarly, DVC might fetch files from remote if they don't exist in the cache.
Using subrepos
If you have
initialized DVC in a subdirectory
of the Git repository, use DVCFileSystem(url, subrepos=True)
to access the
subdirectory.
>>> from dvc.api import DVCFileSystem
>>> url = "https://github.com/iterative/monorepo-example.git"
# by default, DVC initialized in a subdirectory will be ignored
>>> fs = DVCFileSystem(url, rev="develop")
>>> fs.find("nlp", detail=False, dvc_only=True)
[]
# use subrepos=True to list those files
>>> fs = DVCFileSystem(url, subrepos=True, rev="develop")
>>> fs.find("nlp", detail=False, dvc_only=True)
['nlp/data/data.xml', 'nlp/data/features/test.pkl', 'nlp/data/features/train.pkl', 'nlp/data/prepared/test.tsv', 'nlp/data/prepared/train.tsv', 'nlp/eval/importance.png', 'nlp/model.pkl']
fsspec API Reference
As DVCFileSystem is based on fsspec,
it is compatible with most of the APIs that it offers. When DVC is installed in
the same Python environment as any other fsspec-compatible library (such as
Hugging Face Datasets), DVCFileSystem will be used automatically
when a dvc://
filesystem URL is provided to fsspec function calls. For more
details check out the fsspec's API Reference.
Note that dvc://
URLs contain the path to the file you wish to load, relative
to the root of the DVC project. dvc://
URLs should not contain a Git
repository URL. The Git repository URL is provided separately via the url
argument for DVCFileSystem.
When using dvc://
URLs, additional constructor arguments for DVCFileSystem
(such as url
or rev
) should be passed via the storage_options
dictionary
or as keyword arguments, depending on the specific fsspec method behing called.
Please refer to the fsspec documentation for specific details.
fsspec API examples:
For methods which take filesystem arguments as additional keyword arguments:
>>> import fsspec
>>> fsspec.open(
... "dvc://workshop/satellite-data/jan_train.csv",
... url="https://github.com/iterative/dataset-registry.git",
... )
<OpenFile 'workshop/satellite-data/jan_train.csv'>
For methods which take filesystem arguments via the storage_options
dictionary:
>>> import fsspec
>>> fsspec.get_fs_token_paths(
... "dvc://workshop/satellite-data/jan_train.csv",
... storage_options={"url": "https://github.com/iterative/dataset-registry.git"},
... )
(<dvc.fs.dvc._DVCFileSystem object at 0x113f7a290>, '06e54af48d3513bf33a8988c47e6fb47', ['workshop/satellite-data/jan_train.csv'])