DVC compared with GitLFS for storage and versioning only

Hi all - I’m considering using DVC on an existing project that has other functionality for reproducability and has access to essentially unlimited Git+LFS on GitHub (affiliated with a university). If I don’t want to use DVC’s reproducability tools right now, is there a use case for DVC for this project? That is, does DVC provide me a benefit over GitLFS?

I’ve been looking through the docs and it sounds like DVC versions using the whole file (similar to GitLFS), but keeps the cache locally (which I like) and gives me control of the remote storage of the repository (which I also like - I’ve seen weird errors before in GitLFS where a file disappeared - not sure what happened)

Would those be the main benefits?

4 Likes

Thanks @nickrsan!

More context here https://discordapp.com/channels/485586884165107732/563406153334128681/629825409097138186

The obvious one - Github still has 2GB limit of the file size with Git LFS - https://help.github.com/en/articles/about-git-large-file-storage

Second, yes - caching and better/explicit data management, like pull/push data partially

Third, advanced data management - utilize reflinks, hardlinks, etc to do checkout in a very fast manner, push/pull are using parallelism to save/download data (I’m not sure about LFS), ability to use a shared cache to save space if multiple projects use the same data and/or multiple people use the same machine

Fourth, features like dvc import / dvc get and dvc.api python interface - a really great way to reuse data or model files. Use cases like data registry for example via a Github repo with all the history, etc
Or with dvc get and dvc.api - model deployment

And those differences are only about data management :slight_smile: no touching pipelines, mertics part (which I actually also consider a part of the data management layer).

2 Likes