Download a file or directory tracked by another DVC or Git repository into the
workspace, and track it (an import
.dvc file is created).
See also our
dvc.api.open()Python API function.
usage: dvc import [-h] [-q | -v] [-o <path>] [-f] [--rev <commit>] [--no-exec | --no-download] [-j <number>] [--config <path>] [--remote <name>] [--remote-config [<name>=<value> ...]] url path positional arguments: url Location of DVC or Git repository to download from path Path to a file or directory within the repository
Provides an easy way to reuse files or directories tracked in any DVC
repository (e.g. datasets, intermediate results, ML models) or Git
repository (e.g. code, small image/other files).
dvc import downloads the
target file or directory (found at
url), and tracks it in the local
project. This makes it possible to update the import later, if the data source
has changed (see
dvc listfor a way to browse repository contents to find files or directories to import.
dvc getcorresponds to the first step this command performs (just download the data).
The imported data is cached, and linked (or copied) to the current
working directory with its original file name e.g.
data.txt (or to a location
--out). An import
.dvc file is created in the same location
data.txt.dvc – similar to using
dvc add after downloading the data.
url argument specifies the address of the DVC or Git repository containing
the data source. Both HTTP and SSH protocols are supported (e.g.
url can also be a local file system path
(including the current project e.g.
path argument specifies a file or directory to download (paths inside
tracked directories are supported). It should be relative to the root of the
repo (absolute paths are supported when
url is local). Note that DVC-tracked
targets must be found in a
.dvc file of the repo.
dvc import-urlto download and track data from other supported locations such as S3, SSH, HTTP, etc.
.dvc files support references to data in an external DVC repository (hosted on
a Git server). In such a
.dvc file, the
deps field specifies the
path, and the
outs field contains the corresponding local path in the
workspace. It records enough metadata about the imported data to
enable DVC to efficiently determine whether the local copy is out of date.
⚠️ Relevant notes and limitation:
- Source DVC repos should have a
dvc remote defaultcontaining the target data for this command to work.
- The only exception to the above requirement is for local repos, where DVC will try to copy the data from its cache first.
- Limited support for chained imports is available (importing data that was itself imported into the source repo from another one).
- Note that
dvc reprodoesn't check or update import
dvc freeze), use
dvc updateto bring the import up to date from the data source.
--out <path>- specify a
pathto the desired location in the workspace to place the downloaded file or directory (instead of using the current working directory).
--force- when using
--outto specify a local target file or directory, the operation will fail if those paths already exist. this flag will force the operation causing local files/dirs to be overwritten by the command.
--rev <commit>- commit hash, branch or tag name, etc. (any Git revision) of the repository to download the file or directory from. The latest commit (in the default branch) is used by default when this option is not specified.
--no-exec- create the import
.dvcfile without accessing
url(assumes that the data source is valid). This is useful if you need to define the project imports quickly, and import the data later (use
dvc updateto finish the operation(s)).
--no-download- create the import
.dvcfile including the source data information (repository URL and version) but without downloading the associated data. This is useful if you need track changes in remote data without using local storage space (yet). The data can be downloaded later using
dvc pull. File version can be updated using
dvc update --no-download.
--jobs <number>- parallelism level for DVC to download data from the remote. The default value is
4 * cpu_count(). Using more jobs may speed up the operation. Note that the default value can be set in the source repo using the
jobsconfig option of
dvc remote modify.
--config <path>- path to a config file that will be merged with the config in the target repository.
--remote <name>- name of the
dvc remoteto set as a default in the target repository.
--remote-config [<name>=<value> ...]-
dvc remoteconfig options to merge with a remote's config (default or one specified by
--remote) in the target repository.
--help- prints the usage/help message, and exit.
--quiet- do not write anything to standard output. Exit with 0 if no problems arise, otherwise 1.
--verbose- displays detailed tracing information.
A simple case for this command is to import a dataset from an external DVC repository, such as our get started example repo.
$ dvc import firstname.lastname@example.org:iterative/example-get-started \ data/data.xml Importing 'data/data.xml (email@example.com:iterative/example-get-started)' -> 'data.xml'
In contrast with
dvc get, this command doesn't just download the data file,
but it also creates an import
.dvc file with a link to the data source (as
explained in the description above). (This
.dvc file can later be used to
update the import.) Check
md5: 7de90e7de7b432ad972095bc1f2ec0f8 frozen: true wdir: . deps: - path: data/data.xml repo: url: firstname.lastname@example.org:iterative/example-get-started rev_lock: 6c73875a5f5b522f90b5afa9ab12585f64327ca7 outs: - md5: a304afb96060aad90176268345e10355 path: data.xml cache: true
Several of the values above are pulled from the original
in the external DVC repository.
rev_lock subfields under
repo are used to save the origin and
version of the dependency, respectively.
To import a specific version of a file/directory, we may use the
$ dvc import --rev cats-dogs-v1 \ email@example.com:iterative/dataset-registry.git \ use-cases/cats-dogs Importing 'use-cases/cats-dogs (firstname.lastname@example.org:iterative/dataset-registry.git)' -> 'cats-dogs'
When using this option, the import
.dvc file will also have a
deps: - path: use-cases/cats-dogs repo: url: email@example.com:iterative/dataset-registry.git rev: cats-dogs-v1 rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c
rev is a Git branch or tag (where the underlying commit changes), the data
source may have updates at a later time. To bring it up to date if so (and
rev_lock in the
.dvc file), simply use
dvc update <stage>.dvc. If
rev is a specific commit hash (does not change),
dvc update without options
will not have an effect on the import
.dvc file. You may force-update it to a
different commit with
dvc update --rev:
$ dvc update --rev cats-dogs-v2 cats-dogs.dvc
In the above example, the value for
revin the new
.dvcfile will be
master(a branch) so it will be able update normally going forward.
If you take a look at our
project, you'll see that it's organized into different directories
use-cases/, and these contain
that track different datasets. Given this simple structure, its data files can
be easily shared among several other projects using
dvc get and
$ dvc get https://github.com/iterative/dataset-registry \ tutorials/versioning/data.zip
Used in our versioning tutorial
$ dvc import firstname.lastname@example.org:iterative/dataset-registry.git \ use-cases/cats-dogs
dvc import provides a better way to incorporate data files tracked in external
DVC repositories because it saves the connection between the
current project and the source repo. This means that enough information is
recorded in an import
.dvc file in order to
reproduce downloading of this same data version
in the future, where and when needed. This is achieved with the
for example (matching the import command above):
frozen: true deps: - path: use-cases/cats-dogs repo: url: email@example.com:iterative/dataset-registry.git rev_lock: 0547f5883fb18e523e35578e2f0d19648c8f2d5c outs: - md5: b6923e1e4ad16ea1a7e2b328842d56a2.dir path: cats-dogs cache: true
See a full explanation in our Data Registry use case.
You can even import files from plain Git repos that are not DVC repositories. For example, let's import a dataset from GSA's data repo:
$ dvc import firstname.lastname@example.org:GSA/data \ enterprise-architecture/it-standards.csv Importing ...
Note that Git-tracked files can be imported from DVC repos as well.
The file is imported, and along with it, an import
.dvc file is created. Check
deps: - path: enterprise-architecture/it-standards.csv repo: url: email@example.com:GSA/data rev_lock: af6a1feb542dc05b4d3e9c80deb50e6596876e5f outs: - md5: 7e6de779a1ab286745c808f291d2d671 path: it-standards.csv
rev_lock subfields under
repo are used to save the origin and
version of the dependency, respectively.
DVC supports importing data that was itself imported into the source repo, as
long as all the repos in the import chain (and their
dvc remote default) are
accessible from the final destination repo.
Consider an example with 3 DVC repos (A, B, and C). DVC repo
data.csv file tracked with
/repo/a ├── data.csv └── data.csv.dvc
In repo B, we import
data.csv from A and into a subdirectory:
$ dvc import /repo/a data.csv --out training/data.csv
Project B may of course contain other files unique to itself, for example:
/repo/b └── training ├── data.csv ├── data.csv.dvc ├── labels │ ├── test.txt │ └── truth.txt └── labels.dvc
Notice that the
training/labelsdirectory (not an import) is also tracked in B separately.
If we examine
training/data.csv.dvc, we can see that that the import source is
repo A (
deps: - path: data.csv repo: url: /repo/a rev_lock: 32ab3ddc8a0b5cbf7ed8cb252f93915a34b130eb outs: - md5: acbd18db4cc2f85cedef654fccc4a4d8 size: 3234523 path: data.csv
Now lets imagine that we run the following command in our third repo, C:
$ dvc import /repo/b training
This will result in the following directory structure, which contains a chained import and a regular one:
/repo/c ├── training │ ├── data.csv │ └── labels │ ├── test.txt │ └── truth.txt └── training.dvc
training/data.csvis imported from A into B into C
training/labels/is imported from B into C directly
training.dvc only references repo B (
deps: - path: training repo: url: /repo/b rev_lock: 15136ed84b59468b68fd66b8141b41c5be682ced outs: - md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir size: 27 nfiles: 3 path: training
Each time that we
dvc import or
training/ into C (or even
dvc pull it) DVC will first look up the contents of
training in B and notice
training/data.csv is itself an import. It will then resolve the chain as
data.csv in A).
*Note that when running
dvc update trainingfrom repo C, DVC will only check whether or not
training/changed in repo B. So if
data.csvhas only changed in A,
training/data.csvwon't be updated in C until
dvc update training/data.csvhas been run in B.
This means both repos A and B must be reachable when
dvc import runs in repo
C, otherwise the import chain resolution would fail.
dvc remote default for all repos in the import chain must also be
accessible (repo C needs to have all the appropriate credentials).
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared --remote myremote ... $ cat prepared.dvc deps: - path: data/prepared repo: url: https://github.com/iterative/example-get-started-s3 rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b remote: myremote outs: - md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir size: 27 nfiles: 3 path: prepared
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared --remote-config profile=myprofile ... $ cat prepared.dvc deps: - path: data/prepared repo: url: https://github.com/iterative/example-get-started-s3 rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b remote: profile: myprofile outs: - md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir size: 27 nfiles: 3 path: prepared
If remote with that name already exists, its config will be merged with options
$ dvc import https://github.com/iterative/example-get-started-s3 data/prepared \ --remote myremote \ --remote-config url=s3://mybucket/mypath profile=myprofile ... $ cat prepared.dvc deps: - path: data/prepared repo: url: https://github.com/iterative/example-get-started-s3 rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b config: core: remote: myremote remote: myremote: url: s3://mybucket/mypath profile: myprofile outs: - md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir size: 27 nfiles: 3 path: prepared
In this example, instead of using
--remote myremote with
exposing your secrets in dvcfile, you could use
--config to use a gitignored
config file. The format of the config file is the same as produced by
$ cat myconfig [core] remote = myremote [remote "myremote"] access_key_id = myaccesskeyid secret_access_key = mysecretaccesskey $ cat .gitignore # make sure you are not commiting this file to git ... /myconfig ... $ dvc import https://github.com/iterative/example-get-started-s3 data/prepared --config myconfig ... $ cat prepared.dvc deps: - path: data/prepared repo: url: https://github.com/iterative/example-get-started-s3 rev_lock: 8141b41c5be682ced15136ed84b59468b68fd66b config: myconfig outs: - md5: e784c380dd9aa9cb13fbe22e62d7b2de.dir size: 27 nfiles: 3 path: prepared