DVC compared with GitLFS for storage and versioning only

Hi all - I’m considering using DVC on an existing project that has other functionality for reproducability and has access to essentially unlimited Git+LFS on GitHub (affiliated with a university). If I don’t want to use DVC’s reproducability tools right now, is there a use case for DVC for this project? That is, does DVC provide me a benefit over GitLFS?

I’ve been looking through the docs and it sounds like DVC versions using the whole file (similar to GitLFS), but keeps the cache locally (which I like) and gives me control of the remote storage of the repository (which I also like - I’ve seen weird errors before in GitLFS where a file disappeared - not sure what happened)

Would those be the main benefits?

4 Likes

Thanks @nickrsan!

More context here https://discordapp.com/channels/485586884165107732/563406153334128681/629825409097138186

The obvious one - Github still has 2GB limit of the file size with Git LFS - https://help.github.com/en/articles/about-git-large-file-storage

Second, yes - caching and better/explicit data management, like pull/push data partially

Third, advanced data management - utilize reflinks, hardlinks, etc to do checkout in a very fast manner, push/pull are using parallelism to save/download data (I’m not sure about LFS), ability to use a shared cache to save space if multiple projects use the same data and/or multiple people use the same machine

Fourth, features like dvc import / dvc get and dvc.api python interface - a really great way to reuse data or model files. Use cases like data registry for example via a Github repo with all the history, etc
Or with dvc get and dvc.api - model deployment

And those differences are only about data management :slight_smile: no touching pipelines, mertics part (which I actually also consider a part of the data management layer).

2 Likes

@shcheklein I think GitLFS can only push data to BitBucket servers. I am not sure if files tracked by GitLFS can be pushed to GitHub. Can you clarify?

@aman GitHub definitely supports GitLFS - GitHub even built the initial implementation (https://git-lfs.github.com/). Other providers, such as GitLab, support GitLFS as well.

3 Likes

Dose DVC support python API to push/pull data? The command line is useful when I want to add files expliciltly.

I am looking for a solution that could integrate with pipeline (i.e. kedro). I need to programmatically push my model and config for every single run.

running the command line everytime is not a viable option.

1 Like

@nok probably yes, please see Python API Reference

But please open a separate question for this, or reach us at dvc to elaborate on this question if this short answer isn’t useful. We may need more information to understand your use case.

Thanks

Dose DVC support python API to push/pull data? The command line is useful when I want to add files explicitly.

Yes, it’s not documented yet, but there is a Repo class, that implements the same stuff that CLI provides (CLI is built on top of that API). And if that is not enough, since DVC is Python you can even use some DVC internals to manipulate data (we don’t guarantee stability though).

I am looking for a solution that could integrate with pipeline (i.e. kedro). I need to programmatically push my model and config for every single run.

Do you create Git commits though? Usually if you have a lot of runs it might become a problem …

Also, could you describe your setup/use case- I wonder if you could use CML and DVC Viewer to solve some of your problems.

I am not using DVC heavily, I only try to experiment with DVC previously as a data versioning solution, but I am not using it right now.

I have a few questions.

  1. The data versioning feature seems to couple with DVC pipeline, I already had my pipeline tool (e.g. Kedro / Airflow). The input dependencies is already tracked by those DAGs library, but DVC seems to rely on its own DAG. Is there any use case where people using other DAGs library while using DVC?

  2. One of the problems I have is that I have 2 GPU on my machine, so a lot of time I am running more than 1 experiments, DVC doesn’t fit nicely with this situation (I just learn from you that there is a beta feature with parallel runs, I haven’t explored that yet).
    For my case, I just have a new folder with timestamp for storing the artifacts of a particular run (data processing / model training), but DVC is tracking a particular path, so it doesn’t work well in multi-runs situations.

I am not sure what CML or DVC Viewer does.

I do like DVC handling the caching for me, the downside of having a separate folder for artifacts (my current approach) is duplicate files (if the data artifacts are the same, ideally it should just point to a reference instead of writing a new file)

I found an open issue of DVC with Airflow, I guess if there are more examples that how to use DVC with other pipeline tool will help clear my head.

The data versioning feature seems to couple with DVC pipeline, I already had my pipeline tool (e.g. Kedro / Airflow).

@nok, many of our users do use DVC just for data versioning. I am not that familiar with Kedro/Airflow, but you could either dvc add in the scripts or through our Python API or via something integrated to those tools (BashOperator?).

Is there any use case where people using other DAGs library while using DVC?

DVC can be still used to get the appropriate data (using dvc get/checkout on scripts or dvc.api.get on python codebase) or save them. That is, use DVC only for versioning and use different pipeline tools, though, you have to put that in your scripts/pipeline-step, it’s not automatic/integrated for sure.

I am running more than 1 experiments, DVC doesn’t fit nicely with this situation

As you mentioned, experiments is in a beta-state, though you could start using it today (and, would help us shape the feature with your feedback). I don’t have much to add here, as you seem to be aware of this, unless I got that question wrong?

I do like DVC handling the caching for me, the downside of having a separate folder for artifacts (my current approach) is duplicate files (if the data artifacts are the same, ideally it should just point to a reference instead of writing a new file)

Just to clarify, .dvc files having entries for outputs where it’s been checked out?
We could do that, but, it might be too difficult to keep the whole repo in sync (the other entry might now point to a different version of the output later, which means entry has to be adjusted elsewhere).

@nok, many of our users do use DVC just for data versioning. I am not that familiar with Kedro/Airflow, but you could either dvc add in the scripts or through our Python API or via something integrated to those tools (BashOperator?).

I guess I will just have to map the differents combination of steps with dvc run and all of their corresponding input/output
i.e.
step 1 : kedro run --nodes=step1
step 2: kedro run --nodes=step1
step 1 + 2: kedro run --nodes=“step1, step2”

For the parallel experiments, how are the output file is handled? i.e. if test.csv will be written at the end. In my local file system, which files does it keep? and is the test.csv from both run push into dvc automatically?

@nok could you please create a separate question/thread for the experiments, it’s a bit off-topic for this thread.