Comparison to DataLad and git-annex

ilya · September 10, 2018, 5:43pm

What are the main differences between DVC and DataLad/git-annex ( https://www.datalad.org/ , https://git-annex.branchable.com/ )? What would be reasons to use one or the other?

kupruser · September 10, 2018, 7:34pm

Hi @ilya !

There are few main differences that might make you choose one over another:

DataLad uses and requires git and git-annex for all of its operations. DVC generally speaking doesn’t require any SCM at all for the core features, but can also be made to work with any SCM of your choice(e.g. Mercurial) by adding a simple driver to dvc/scm.py. Also the git-annex itself has its own pros and cons and some people might prefer to avoid using it. In dvc we use our own system to store and transfer data through a variety of ways(i.e. Amazon s3, Google Cloud Storage, Microsoft Azure, SSH, HDFS, etc).
DataLad is more focused on the data itself and provides a convenient way to discover and use datasets created by other people with a single command, while dvc is currently focused on separate projects and doesn’t provide that functionality(though we are working on it).
DataLad is using git to provide reproducibility, while dvc stores every stage of your pipeline in a separate human-readable .dvc file and provides convenient tools to manipulate the DAG(e.g. visualization).
With DVC you can specify external dependencies and external outputs in your pipeline stages and have them automatically tracked and cached without transfering them to your local machine. As far as I know this is not possible with DataLad and might be even impossible to implement because of the design relying on the git and git-annex.

Thanks,
Ruslan

drorata · June 2, 2020, 5:37am

Almost 2 year later, is there anything else to add to this comparison?

jorgeorpinel · June 4, 2020, 4:03pm

The data versioning layer of DVC is still basically the same, although .dvc files in 1.0 are much simpler and easier to edit manually — e.g. you can a bunch of different files independently, and combine the resulting .dvc files into a single one later.

But otherwise yes! DVC provides many more features now, on pipeline control (also made easier with a single dvc.yaml instead of multiple stage files in 1.0), performance, and experiment management mainly. For example params, metrics, and plots (those are all DVC commands).

adina · October 5, 2020, 2:53pm

I hope its okay for me to post this here (I’m from the DataLad team). We’ve been asked this question, too, and we have written a procedural comparison between the two tools.

We’ve recreated a workflow with from DVC with DataLad here, and we’re showcasing an ML analysis with DataLad in a DataLad-centric way here.

Topic		Replies	Views
DVC compared with GitLFS for storage and versioning only Questions	12	6910	October 13, 2020
DVC local storage usecase Questions	6	1605	January 20, 2021
Using DVC for non-machine learning models Questions	1	806	October 2, 2020
DVC Heartbeat - Discord gems Announcements	3	4165	June 27, 2019
Using DVC outside git Questions	10	1220	January 11, 2022

Comparison to DataLad and git-annex

Related topics