Bast AI Use Case: Using DVC as a Data Registry for Explainable, Offline-Ready AI

How Bast AI uses DVC as a data registry for unstructured AI pipelines—versioning PDFs, page images, ontologies, and retrieval context to build an explainable, offline-ready medical assistant with full provenance and auditability.

Jeny De Figueiredo

February 11, 2026

7 minutes read

When people think about DVC, they often picture classic ML workflows: versioning datasets, tracking experiments, storing model artifacts.

But one of the most powerful patterns we’re seeing is using DVC as a data registry for unstructured data pipelines, where the “dataset” isn’t a CSV, it’s a living collection of PDFs, page images, extracted text, embeddings, ontology files, and derived outputs that must remain traceable, reproducible, and auditable over time.

This post is based on my DVC Community Meetup with Beth Rudden (CEO) and Thanh Lam (CTO) of Bast AI, and walks through how Bast AI uses DVC to power an AI medical assistant that can run with limited or no internet connectivity, while maintaining strong guarantees around explainability and provenance.

The Problem: Unstructured AI Needs Trust, Not Just Answers

In regulated or safety-critical environments such as healthcare, emergency response, defense, and finance, AI failures move beyond annoying to dangerous.

Teams building AI assistants in these domains tend to run into a few hard requirements:

Provenance: Where did this answer come from?
Reproducibility: Can I reproduce the exact response path from last month?
Determinism: For some questions, the answer must be the same every time.
Offline operation: This has to work on-device or at the edge.
Auditability: If this decision is questioned, we need an audit trail.

If your system relies on a black-box model call, plus a murky retrieval layer with no versioning, you can’t meet these requirements consistently.

The Solution Pattern: DVC as the System of Record for Unstructured Pipelines

Bast AI uses DVC as the system of record for its AI pipeline. Instead of treating DVC as dataset versioning, they treat it as a registry of every artifact that makes an answer trustworthy.

Their pipeline versions:

Source PDFs (e.g., handbooks, manuals, internal docs)
Page-level images (a screenshot of each page)
Extracted page text
Structured context objects used for retrieval (stored in OpenSearch, linked back to DVC-tracked artifacts)
Ontology / knowledge graph files used for deterministic routing and explainability
Derived assets like QA pairs and quiz questions

This structure supports full lineage: answer -> context -> source page image -> source document version -> pipeline run outputs.

That chain is the difference between an unreliable, “AI said so” and, instead, an answer that shows exactly why this output was produced.

Example Implementation: A Medic Co-Pilot That Works Offline

A concrete example of this pattern is Bast AI’s Medic co-pilot, built in part from a medical handbook (Pararescue Medical Operations (PJ Med) Handbook, 8th Edition). Bast refers to this technology as a “Cat,” Conversational AI Technology. This CAT includes Natural Language Understanding, Natural Language Classification, and Natural Language Generation. Most bots are only using Natural Language Generation on some distinct dataset or set of datasets. Bast’s technology can provide a deterministic answer. Even answering the same question, the same way, every time. This is how they do it.

Bast AI Medic Co-pilot Architecture with DVC

The Technology Stack

Bast AI’s architecture leverages:

DVC: Data version control and registry
Containerization: Docker for portability
Microservices: Composable, modular components
Open source LLMs: For enrichment, not prediction
OpenSearch: For versioned context storage
Protégé: For ontology visualization

Step 1: Ingest the handbook into a versioned registry

The pipeline starts with a source PDF added to DVC as raw input into Cloud Storage. In this case AWS s3 buckets. (See Cloud Storage at the bottom of the diagram).

Step 2: Convert PDF into page-level artifacts

A processing stage explodes the PDF into:

page_1.png, page_2.png, …
page_1.txt, page_2.txt, …
metadata linking each artifact back to the original source

This is the first key design choice: DVC tracks the page images, not just the extracted text.

Why this matters:

Page images allow users to verify answers visually.
They provide stronger audit evidence than text alone (especially when formatting matters).
They make it easier to operate offline: the “proof” ships with the assistant.

Step 3: Create retrieval-ready context, linked to sources

Next, the extracted text is processed into context objects for search/retrieval, Bast stores this in OpenSearch vector database. Those retrieval objects aren’t floating around unattached. They’re derived from, and trace back to, versioned artifacts in DVC.

Step 4: Serve answers with provenance

When a user asks a question, the assistant can return:

the generated response
the exact source page image
bibliographic reference (title/edition/author)
the context used to generate the answer

The CAT can “show its work.” This ability is completely dependent on having a registry where those assets are versioned and retrievable.

Deterministic + Generative Outputs: Both Benefit from Versioning

In real systems, not all answers are equal. Bast supports two response types: Deterministic responses (must be identical every time), and Generative responses (useful variation, grounded by sources).

The deterministic responses are used for procedures and high-risk guidance. To achieve this, they use an ontology/knowledge graph (via Protégé) and human-in-the-loop, expert-curated mappings, so the system returns the same output every time. DVC tracks the ontology files and versions them like any other artifact. This ensures that if a change in an ontology occurs, the resulting change in behavior can be traced back to the correct ontology version.

Generative responses from LLMs are used for enrichment and variation. In the same way as the underlying sources change, outputs will change. DVC traces every change to the data.

Why This Works Well With DVC

This use case is a strong fit for DVC because:

Reproducibility across complex pipelines

Unstructured pipelines generate many intermediate files (images, chunks, JSON, embeddings, indexes). DVC turns that chaos into a reproducible, navigable lineage.

Collaboration without overwriting each other

Multiple contributors may touch the same pipeline (ingestion, ontology editing, QA generation). DVC provides a shared registry so teams can coordinate changes and roll back when needed.

“Offline-ready” by design

Shipping an edge assistant means shipping a bundle: sources + derived artifacts + the retrieval context. Versioning makes packaging safer because you know exactly what’s inside the bundle.

Audits become possible

If someone asks, “Why did the assistant recommend step 6?” you can point to:

the DVC-tracked source
the specific page snapshot
the version of the ontology/context at that time

That is a fundamentally different system design than a typical “LLM + RAG” demo stack.

For teams operating at larger scale, these same versioning principles can also be applied directly at the object-storage layer, as a control plane, particularly when data spans multiple pipelines or teams and benefits from Git-like semantics outside a single repository.

lakeFS is designed for enterprise AI and data engineering teams that require highly scalable data version control infrastructure with petabyte-scale multimodal object stores and data lakes.

The Hard Part: Team Hygiene, Not the Tool

One practical lesson from this pattern: the biggest challenge is often behavioral.

Thanh mentioned that the best practice is for teams to commit frequently and communicate about data changes the same way they do code changes. Otherwise, uncommitted artifacts diverge, and you end up resolving conflicts late. A simple internal rule helps: If you produced data another teammate needs, commit it to DVC, and be sure to use meaningful commit messages describing what changed in the dataset/artifacts.

Key Takeaways

“Take the hard road to easy. Don’t take the easy road to hard.” – Beth Rudden: Do the versioning work upfront to avoid problems later
Sanitize inputs to control outputs: Clean, versioned data leads to reliable results
Use the right tool for the right problem: LLMs for enrichment, logic for prediction
Show your work: Attribution and transparency build trust
Augment, don’t replace: AI should enhance human capabilities, not automate them away

Conclusion

As Beth Rudden puts it: “We’re starting a movement. We want people to understand that you can have fully explainable AI, especially if you use the right tool for the right problem.”

By leveraging DVC for data versioning, you can build AI systems that are:

Transparent and explainable
Auditable and compliant
Reproducible and reliable
Collaborative and scalable

You can catch the full meetup below.

For more information about Bast AI’s approach to ethical AI, visit their website or check out Beth Rudden’s book “AI for the Rest of Us.” For comprehensive DVC documentation, visit https://docs.dvc.org