Defer loading data when cloning repository?

We currently use a single repository for data science related projects (mainly due to limitations of GitHub private repositories). When cloning that repository, all Git LFS files are downloaded, which takes quite a while. Is there a way that using DVC would allow us to defer loading data from some of our data-science sub-projects until we are ready to work on the notebooks?

Hi @brylie!

Yes, with DVC it is manageable.

  1. dvc pull and dvc push sync only on the latest data file version a current branch. So, if your project is separated by branches - it will work just fine.
  2. If you have all the projects in the same branch (like master) you can sync data files sebset by specifying DVC file: dvc pull pmap_project.dvc.

Is there an approach that would work when cloning the repository?

Hi @brylie!

When you clone your git repository that uses dvc inside, dvc doesn’t download any data until you manually tell it to by running dvc pull command. So basically, cloning would look like:

  1. git clone /path/to/remote - no different with and without dvc, downloads only code;
  2. dvc pull or dvc pull mydata.dvc - downloads data;

Thanks,
Ruslan

1 Like