Comparison to DataLad and git-annex

What are the main differences between DVC and DataLad/git-annex ( , )? What would be reasons to use one or the other?


Hi @ilya !

There are few main differences that might make you choose one over another:

  1. DataLad uses and requires git and git-annex for all of its operations. DVC generally speaking doesn’t require any SCM at all for the core features, but can also be made to work with any SCM of your choice(e.g. Mercurial) by adding a simple driver to dvc/ Also the git-annex itself has its own pros and cons and some people might prefer to avoid using it. In dvc we use our own system to store and transfer data through a variety of ways(i.e. Amazon s3, Google Cloud Storage, Microsoft Azure, SSH, HDFS, etc).

  2. DataLad is more focused on the data itself and provides a convenient way to discover and use datasets created by other people with a single command, while dvc is currently focused on separate projects and doesn’t provide that functionality(though we are working on it).

  3. DataLad is using git to provide reproducibility, while dvc stores every stage of your pipeline in a separate human-readable .dvc file and provides convenient tools to manipulate the DAG(e.g. visualization).

  4. With DVC you can specify external dependencies and external outputs in your pipeline stages and have them automatically tracked and cached without transfering them to your local machine. As far as I know this is not possible with DataLad and might be even impossible to implement because of the design relying on the git and git-annex.



Almost 2 year later, is there anything else to add to this comparison?

The data versioning layer of DVC is still basically the same, although .dvc files in 1.0 are much simpler and easier to edit manually — e.g. you can a bunch of different files independently, and combine the resulting .dvc files into a single one later.

But otherwise yes! DVC provides many more features now, on pipeline control (also made easier with a single dvc.yaml instead of multiple stage files in 1.0), performance, and experiment management mainly. For example params, metrics, and plots (those are all DVC commands).

1 Like

I hope its okay for me to post this here (I’m from the DataLad team). We’ve been asked this question, too, and we have written a procedural comparison between the two tools.

We’ve recreated a workflow with from DVC with DataLad here, and we’re showcasing an ML analysis with DataLad in a DataLad-centric way here.