Learn from AI, ML & data leaders

March 31, 2026 | Live

Learn from AI, ML & data leaders from Dell, Lockheed Martin, Red Hat & more

From DVC to lakeFS: Zero-Copy Imports of Your Tracked Data

dvc-to-lakefs brings your DVC-tracked data into lakeFS with a zero-copy import — no re-uploading, no second copy. Get branches, commits, and merges at data-lake scale.

Itai Admi
Itai Admi
+1Saugat Pachhai
8 minutes read
Migrate DVC to lakeFS - Zero-copy Data Import

Today we’re releasing dvc-to-lakefs, a small tool that brings your DVC-tracked data into lakeFS with a zero-copy import. No re-uploading, no second copy of your data: just your existing objects, now versioned with branches, commits, and merges at data-lake scale.

If you’ve been using DVC by lakeFS, there’s a good chance you already know we’re on the same team now: lakeFS acquired DVC last November. DVC is excellent at what it was built for: lightweight, Git-native versioning for a data scientist working on a single project with small datasets that comfortably round-trip through a local cache. It starts in seconds and needs no server.

But projects grow. At some point you’re tracking tens of thousands of files, or datasets too big to pull onto a laptop, or you’ve got several people and pipelines hitting the same data at once, or you want CI/CD that gates on data quality before anything reaches production. That’s the world lakeFS is built for: scalable data version control as shared infrastructure, sitting directly on top of your object storage.

The friction has always been getting from one to the other. dvc-to-lakefs removes it.

pip install dvc-to-lakefs

When It’s Time to Move Up

It’s worth being precise about where the line is, because it’s not about one tool being “better.”

DVC keeps your data in cloud storage and your version info in Git, and moves bytes through a local cache with dvc add, dvc push, and dvc pull. That model is a great fit when the data fits the laptop-and-cache rhythm.

lakeFS takes over when data versioning stops being a personal workflow and becomes something a team and its pipelines depend on. A few places where that shows up:

  • Scale. lakeFS is built to version object storage directly (tens of thousands to billions of objects) without staging everything through a local cache first.
  • Isolation without duplication. A lakeFS branch is a full, isolated copy of your data created instantly and at zero storage cost, because it references existing objects rather than copying them. Every experiment, backfill, or pipeline run can get its own branch off production.
  • Atomicity and consistency. Commits and merges are atomic across many files at once, so downstream consumers never see a half-written dataset, and related datasets stay consistent with each other.
  • Collaboration and CI/CD. Multiple people and jobs can read and write against branches concurrently, then open a pull request to review the changes and merge them once they’re validated. Pre-merge hooks can run data-quality checks so bad data never lands on your main branch.
  • Ecosystem. lakeFS plugs into Spark, Databricks, Airflow, Iceberg and many other tools and frameworks, and can be mounted as a filesystem for tools that expect local paths.

In short: DVC for the single-project, fits-on-disk sweet spot; lakeFS when data version control becomes infrastructure the whole team leans on.

Setting Up lakeFS

If you’re not already running lakeFS, the quickest path is lakeFS Cloud, or you can self-host with a single Docker container. The quickstart walks through both.

There’s one requirement that matters more than any other for this import to work, so it’s worth calling out up front: create your lakeFS repository on the same blockstore as your DVC remote. If your DVC data lives in an S3 bucket, point your lakeFS repository at that same bucket and region. This is exactly what makes the import zero-copy: lakeFS references the very objects DVC already pushed, instead of pulling and re-writing them. If the two don’t match, the tool will stop and tell you rather than silently copying data around.

For credentials, the tool reads your lakeFS config from ~/.lakectl.yaml by default (or wherever LAKECTL_CONFIG_FILE points).

Migrating With dvc-to-lakefs

Before you run anything, make sure you have:

  • Python 3.10+
  • a Git-backed DVC repo (a dvc init --no-scm repo won’t work)
  • a configured DVC remote with the data already pushed (dvc push)
  • a running lakeFS instance with a repository on the same blockstore as that remote
  • lakeFS credentials configured, and read access to the remote storage from both DVC and the lakeFS server

After installing, the whole migration is one command:

lakectl import-from-dvc ./my-dvc-repo lakefs://my-repo

Here’s what happens under the hood. The tool reads the HEAD of your current Git branch and imports every tracked DVC output into a lakeFS branch of the same name, creating that branch from the repository’s default branch if it doesn’t exist yet. It then makes a commit on that branch containing the imported files. The commit message matches your Git commit message, and the commit carries a git_sha metadata field pointing at the Git revision it came from, so every lakeFS commit traces straight back to the code and config that produced it.

When you want to see what an import will do before committing to it, use --dry-run:

lakectl import-from-dvc ./my-dvc-repo lakefs://my-repo --dry-run

The full set of options:

FlagDescription
--branch <branch>Export a specific Git branch instead of the current one; repeat to export several at once (default: the current branch).
-r, --remote <name>
Use a specific DVC remote rather than the repo’s default.
–-dry-runPreview the import plan without writing anything to lakeFS.
--skip-broken-revsWhen exporting multiple branches, skip any that fail instead of aborting the whole run.
--show-filesExpand directory outputs to list every file instead of a single summary line.
# export two branches in one go
lakectl import-from-dvc ./my-dvc-repo lakefs://my-repo --branch main --branch dev

# use a specific remote
lakectl import-from-dvc ./my-dvc-repo lakefs://my-repo --remote staging

Once it finishes, switch to the lakeFS UI and you’ll see the new branch, the commit (with its git_sha in the metadata), and your objects, referenced in place, with nothing re-uploaded.

A couple of things to expect

Not every DVC output can be imported, and the tool reports anything it skips under a “Skipped” heading rather than failing quietly. The main cases are stages that were never run (no hash info), directory outputs with a missing or corrupted cache (a dvc push usually fixes these), outputs marked cache: false or push: false, dvc import / dvc import-url stages, external or out-of-repo paths, and per-output remote: overrides in dvc.yaml.

Two limits worth setting expectations around: only the HEAD of each branch is exported (Git history isn’t replayed) and uncommitted or staged changes are ignored, so commit in Git first. Supported remotes are S3, GCS, and Azure Blob Storage. Imported objects are not automatically garbage collected by lakeFS: to remove them from the object store following an import, you must manually execute dvc gc.

Running DVC and lakeFS side by side

You don’t have to stop using DVC the moment you import. The two tools run in parallel without stepping on each other: the import reads your committed DVC state at a single point in time rather than syncing continuously. The flip side is that anything you add to DVC after an import won’t appear in lakeFS on its own. When you want the newer state in lakeFS, commit it in Git and re-run the import to bring it across.

One thing to watch if you keep both around: the import is zero-copy, so your lakeFS commits point at the very same objects DVC pushed to the remote. Running dvc gc can therefore delete objects that older lakeFS commits still reference, leaving those commits unreadable. This holds for any imported data in lakeFS, whose underlying objects are managed by the source system rather than by lakeFS, so lakeFS garbage collection won’t protect them. Avoid dvc gc against a remote you’ve imported from unless you’re certain no lakeFS commit depends on what it would remove.

What You Get After Migrating

Once your data is in lakeFS, the same datasets you were already storing pick up a Git-like interface:

  • Zero-copy branches: spin up an isolated environment over production data in seconds, with no duplication, for every experiment or pipeline.
  • Reproducibility: every commit is immutable; pin one and your inputs can’t drift underneath you. The git_sha link ties each data version back to the exact code that produced it.
  • Atomic promotion: merge validated changes into your main branch as a single, all-or-nothing operation.
  • Quality gates: pre-merge hooks stop bad data before it reaches downstream consumers.
  • Scale and integrations: billions of objects if you need them, plus native Spark, Databricks, Airflow, and Iceberg support and filesystem mounts.
  • Advanced Options: When you’re ready to scale beyond a single user, you can easily move to lakeFS Enterprise – adding what production deployments need: RBAC, SSO, SCIM, audit trails, metadata search, Mount, Iceberg REST Catalog (Enterprise in preview at time of writing), and the option of a fully managed cloud deployment.

And your data never moved. lakeFS is managing the exact objects that were already sitting in your bucket.

Learn More

dvc-to-lakefs is on GitHub. Issues and contributions welcome. To go deeper on lakeFS itself, check out the docs, come say hi in our Slack community, or try lakeFS and run an import against one of your own repos.

We’d love to hear how it goes.

📰 Join our Newsletter to stay up to date with news and contributions from the Community!

We use cookies to improve your experience and understand how our site is used.

Learn more in our Privacy Policy