import
Download a file or directory tracked by another DVC or Git repository into the
workspace, and track it (an import .dvc
file is created).
See also our
dvc.api.open()
Python API function.
Synopsis
usage: dvc import [-h] [-q | -v]
[-o <path>] [-f] [--rev <commit>]
[--no-exec | --no-download]
[-j <number>]
[--config <path>] [--remote <name>]
[--remote-config [<name>=<value> ...]]
url path
positional arguments:
url Location of DVC or Git repository to download from
path Path to a file or directory within the repository
Description
Provides an easy way to reuse files or directories tracked in any DVC
repository (e.g. datasets, intermediate results, ML models) or Git
repository (e.g. code, small image/other files). dvc import
downloads the
target file or directory (found at path
in url
), and tracks it in the local
project. This makes it possible to update the import later, if the data source
has changed (see dvc update
).
See
dvc list
for a way to browse repository contents to find files or directories to import.
Note that
dvc get
corresponds to the first step this command performs (just download the data).
The imported data is cached, and linked (or copied) to the current
working directory with its original file name e.g. data.txt
(or to a location
provided with --out
). An import .dvc
file is created in the same location
e.g. data.txt.dvc
– similar to using dvc add
after downloading the data.
DVC won't push data imported from other DVC repos to remote storage. dvc pull
will download from the original source.
The url
argument specifies the address of the DVC or Git repository containing
the data source. Both HTTP and SSH protocols are supported (e.g.
[user@]server:project.git
). url
can also be a local file system path
(including the current project e.g. .
).
The path
argument specifies a file or directory to download (paths inside
tracked directories are supported). It should be relative to the root of the
repo (absolute paths are supported when url
is local). Note that DVC-tracked
targets must be found in a dvc.yaml
or .dvc
file of the repo.
See
dvc import-url
to download and track data from other supported locations such as S3, SSH, HTTP, etc.
.dvc
files support references to data in an external DVC repository (hosted on
a Git server). In such a .dvc
file, the deps
field specifies the url
and
data path
, and the outs
field contains the corresponding local path in the
workspace. It records enough metadata about the imported data to
enable DVC to efficiently determine whether the local copy is out of date.
To actually version the data, git add
(and git commit
) the import .dvc
file.
⚠️ Relevant notes and limitation:
- Source DVC repos should have a
dvc remote default
containing the target data for this command to work. - The only exception to the above requirement is for local repos, where DVC will try to copy the data from its cache first.
- Limited support for chained imports is available (importing data that was itself imported into the source repo from another one).
- Note that
dvc repro
doesn't check or update import.dvc
files (seedvc freeze
), usedvc update
to bring the import up to date from the data source.
Options
-
-o <path>
,--out <path>
- specify apath
to the desired location in the workspace to place the downloaded file or directory (instead of using the current working directory). -
-f
,--force
- when using--out
to specify a local target file or directory, the operation will fail if those paths already exist. this flag will force the operation causing local files/dirs to be overwritten by the command. -
--rev <commit>
- commit hash, branch or tag name, etc. (any Git revision) of the repository to download the file or directory from. The latest commit (in the default branch) is used by default when this option is not specified.Note that this adds a
rev
field in the import.dvc
file that fixes it to the revision. This can impact the behavior ofdvc update
(see the Importing and updating fixed revisions example below). -
--no-exec
- create the import.dvc
file without accessingurl
(assumes that the data source is valid). This is useful if you need to define the project imports quickly, and import the data later (usedvc update
to finish the operation(s)). -
--no-download
- create the import.dvc
file including the source data information (repository URL and version) but without downloading the associated data. This is useful if you need track changes in remote data without using local storage space (yet). The data can be downloaded later usingdvc pull
. File version can be updated usingdvc update --no-download
. -
-j <number>
,--jobs <number>
- parallelism level for DVC to download data from the remote. The default value is4 * cpu_count()
. Using more jobs may speed up the operation. Note that the default value can be set in the source repo using thejobs
config option ofdvc remote modify
. -
--config <path>
- path to a config file that will be merged with the config in the target repository. -
--remote <name>
- name of thedvc remote
to set as a default in the target repository. -
--remote-config [<name>=<value> ...]
-dvc remote
config options to merge with a remote's config (default or one specified by--remote
) in the target repository. -
-h
,--help
- prints the usage/help message, and exit. -
-q
,--quiet
- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1. -
-v
,--verbose
- displays detailed tracing information.
Examples
A simple case for this command is to import a dataset from an external DVC repository, such as our get started example repo.
$ dvc import git@github.com:iterative/example-get-started \
data/data.xml
Importing 'data/data.xml (git@github.com:iterative/example-get-started)'
-> 'data.xml'
In contrast with dvc get
, this command doesn't just download the data file,
but it also creates an import .dvc
file with a link to the data source (as
explained in the description above). (This .dvc
file can later be used to
update the import.) Check data.xml.dvc
:
md5: 7de90e7de7b432ad972095bc1f2ec0f8
frozen: true
wdir: .
deps:
- path: data/data.xml
repo:
url: git@github.com:iterative/example-get-started
rev_lock: 6c73875a5f5b522f90b5afa9ab12585f64327ca7
outs:
- md5: a304afb96060aad90176268345e10355
path: data.xml
cache: true
Several of the values above are pulled from the original .dvc
file
in the external DVC repository.
The url
and rev_lock
subfields under repo
are used to save the origin and
version of the dependency, respectively.
Example: Importing and updating fixed revisions
To import a specific version of a file/directory, we may use the --rev
option:
$ dvc import --rev cats-dogs-v1 \
git@github.com:iterative/dataset-registry.git \
use-cases/cats-dogs
Importing
'use-cases/cats-dogs (git@github.com:iterative/dataset-registry.git)'
-> 'cats-dogs'
When using this option, the import .dvc
file will also have a rev
subfield
under repo
:
deps:
- path: use-cases/cats-dogs
repo:
url: git@github.com:iterative/dataset-registry.git
rev: cats-dogs-v1
rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c
If rev
is a Git branch or tag (where the underlying commit changes), the data
source may have updates at a later time. To bring it up to date if so (and
update rev_lock
in the .dvc
file), simply use dvc update <stage>.dvc
. If
rev
is a specific commit hash (does not change), dvc update
without options
will not have an effect on the import .dvc
file. You may force-update it to a
different commit with dvc update --rev
:
$ dvc update --rev cats-dogs-v2 cats-dogs.dvc
In the above example, the value for
rev
in the new.dvc
file will bemaster
(a branch) so it will be able update normally going forward.
Example: Data registry
If you take a look at our
dataset registry
project, you'll see that it's organized into different directories
such as tutorials/versioning
and use-cases/
, and these contain .dvc
files
that track different datasets. Given this simple structure, its data files can
be easily shared among several other projects using dvc get
and dvc import
.
For example:
$ dvc get https://github.com/iterative/dataset-registry \
tutorials/versioning/data.zip
Used in our versioning tutorial
Or
$ dvc import git@github.com:iterative/dataset-registry.git \
use-cases/cats-dogs
dvc import
provides a better way to incorporate data files tracked in external
DVC repositories because it saves the connection between the
current project and the source repo. This means that enough information is
recorded in an import .dvc
file in order to
reproduce downloading of this same data version
in the future, where and when needed. This is achieved with the repo
field,
for example (matching the import command above):
frozen: true
deps:
- path: use-cases/cats-dogs
repo:
url: git@github.com:iterative/dataset-registry.git
rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c
outs:
- md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir
path: cats-dogs
cache: true
See a full explanation in our Data Registry use case.
Example: Importing from any Git repository
You can even import files from plain Git repos that are not DVC repositories. For example, let's import a dataset from GSA's data repo:
$ dvc import git@github.com:GSA/data \
enterprise-architecture/it-standards.csv
Importing ...
Note that Git-tracked files can be imported from DVC repos as well.
The file is imported, and along with it, an import .dvc
file is created. Check
it-standards.csv.dvc
:
deps:
- path: enterprise-architecture/it-standards.csv
repo:
url: git@github.com:GSA/data
rev_lock: af6a1feb542dc05b4d3e9c80deb50e6596876e5f
outs:
- md5: 7e6de779a1ab286745c808f291d2d671
path: it-standards.csv
The url
and rev_lock
subfields under repo
are used to save the origin and
version of the dependency, respectively.
Example: Chained imports
DVC supports importing data that was itself imported into the source repo, as
long as all the repos in the import chain (and their dvc remote default
) are
accessible from the final destination repo.
Consider an example with 3 DVC repos (A, B, and C). DVC repo /repo/a
contains
a data.csv
file tracked with dvc add
:
/repo/a
├── data.csv
└── data.csv.dvc
In repo B, we import data.csv
from A and into a subdirectory:
$ dvc import /repo/a data.csv --out training/data.csv
Project B may of course contain other files unique to itself, for example:
/repo/b
└── training
├── data.csv
├── data.csv.dvc
├── labels
│ ├── test.txt
│ └── truth.txt
└── labels.dvc
Notice that the
training/labels
directory (not an import) is also tracked in B separately.
If we examine training/data.csv.dvc
, we can see that that the import source is
repo A (/repo/a
):
deps:
- path: data.csv
repo:
url: /repo/a
rev_lock: 32ab3ddc8a0b5cbf7ed8cb252f93915a34b130eb
outs:
- md5: acbd18db4cc2f85cedef654fccc4a4d8
size: 3234523
path: data.csv
Now lets imagine that we run the following command in our third repo, C:
$ dvc import /repo/b training
This will result in the following directory structure, which contains a chained import and a regular one:
/repo/c
├── training
│ ├── data.csv
│ └── labels
│ ├── test.txt
│ └── truth.txt
└── training.dvc
training/data.csv
is imported from A into B into Ctraining/labels/
is imported from B into C directly
However, training.dvc
only references repo B (/repo/b
):
deps:
- path: training
repo:
url: /repo/b
rev_lock: 15136ed84b59468b68fd66b8141b41c5be682ced
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: training
Each time that we dvc import
or dvc update
* training/
into C (or even
dvc pull
it) DVC will first look up the contents of training
in B and notice
that training/data.csv
is itself an import. It will then resolve the chain as
needed (finding data.csv
in A).
*Note that when running
dvc update training
from repo C, DVC will only check whether or nottraining/
changed in repo B. So ifdata.csv
has only changed in A,training/data.csv
won't be updated in C untildvc update training/data.csv
has been run in B.
This means both repos A and B must be reachable when dvc import
runs in repo
C, otherwise the import chain resolution would fail.
The dvc remote default
for all repos in the import chain must also be
accessible (repo C needs to have all the appropriate credentials).
Example: Set default remote
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared --remote myremote
...
$ cat prepared.dvc
deps:
- path: data/prepared
repo:
url: https://github.com/iterative/example-get-started-s3
rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b
remote: myremote
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: prepared
Example: Set AWS profile for default remote
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared --remote-config profile=myprofile
...
$ cat prepared.dvc
deps:
- path: data/prepared
repo:
url: https://github.com/iterative/example-get-started-s3
rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b
remote:
profile: myprofile
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: prepared
Example: Create new AWS S3 remote and set it as default
If remote with that name already exists, its config will be merged with options
provided by --remote-config
.
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared \
--remote myremote \
--remote-config url=s3://mybucket/mypath profile=myprofile
...
$ cat prepared.dvc
deps:
- path: data/prepared
repo:
url: https://github.com/iterative/example-get-started-s3
rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b
config:
core:
remote: myremote
remote:
myremote:
url: s3://mybucket/mypath
profile: myprofile
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: prepared
Example: Set AWS secret keys for particular remote
In this example, instead of using --remote myremote
with --remote-config
and
exposing your secrets in dvcfile, you could use --config
to use a gitignored
config file. The format of the config file is the same as produced by
dvc config
.
$ cat myconfig
[core]
remote = myremote
[remote "myremote"]
access_key_id = myaccesskeyid
secret_access_key = mysecretaccesskey
$ cat .gitignore # make sure you are not commiting this file to git
...
/myconfig
...
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared --config myconfig
...
$ cat prepared.dvc
deps:
- path: data/prepared
repo:
url: https://github.com/iterative/example-get-started-s3
rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b
config: myconfig
outs:
- md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir
size: 27
nfiles: 3
path: prepared