Comparison to DataLad and git-annex


#1

What are the main differences between DVC and DataLad/git-annex ( https://www.datalad.org/ , https://git-annex.branchable.com/ )? What would be reasons to use one or the other?


#2

Hi @ilya !

There are few main differences that might make you choose one over another:

  1. DataLad uses and requires git and git-annex for all of its operations. DVC generally speaking doesn’t require any SCM at all for the core features, but can also be made to work with any SCM of your choice(e.g. Mercurial) by adding a simple driver to dvc/scm.py. Also the git-annex itself has its own pros and cons and some people might prefer to avoid using it. In dvc we use our own system to store and transfer data through a variety of ways(i.e. Amazon s3, Google Cloud Storage, Microsoft Azure, SSH, HDFS, etc).

  2. DataLad is more focused on the data itself and provides a convenient way to discover and use datasets created by other people with a single command, while dvc is currently focused on separate projects and doesn’t provide that functionality(though we are working on it).

  3. DataLad is using git to provide reproducibility, while dvc stores every stage of your pipeline in a separate human-readable .dvc file and provides convenient tools to manipulate the DAG(e.g. visualization).

  4. With DVC you can specify external dependencies and external outputs in your pipeline stages and have them automatically tracked and cached without transfering them to your local machine. As far as I know this is not possible with DataLad and might be even impossible to implement because of the design relying on the git and git-annex.

Thanks,
Ruslan