Defer loading data when cloning repository?

brylie · May 21, 2018, 1:24pm

We currently use a single repository for data science related projects (mainly due to limitations of GitHub private repositories). When cloning that repository, all Git LFS files are downloaded, which takes quite a while. Is there a way that using DVC would allow us to defer loading data from some of our data-science sub-projects until we are ready to work on the notebooks?

dmitry · May 22, 2018, 6:19pm

Hi @brylie!

Yes, with DVC it is manageable.

dvc pull and dvc push sync only on the latest data file version a current branch. So, if your project is separated by branches - it will work just fine.
If you have all the projects in the same branch (like master) you can sync data files sebset by specifying DVC file: dvc pull pmap_project.dvc.

brylie · May 29, 2018, 8:07am

Is there an approach that would work when cloning the repository?

kupruser · May 29, 2018, 9:47am

Hi @brylie!

When you clone your git repository that uses dvc inside, dvc doesn’t download any data until you manually tell it to by running dvc pull command. So basically, cloning would look like:

git clone /path/to/remote - no different with and without dvc, downloads only code;
dvc pull or dvc pull mydata.dvc - downloads data;

Thanks,
Ruslan

Topic		Replies	Views
Hi everyone! First question - How to point multiple projects to single dataset? Questions	5	1411	February 17, 2021
Looking for Workflow Suggestion Questions	2	183	December 21, 2023
"dvc status" confusion in concurrent use case Questions	3	441	August 7, 2022
Dvc with git sparse-checkout Questions	3	603	January 31, 2023
First steps with DVC, a few questions Questions	2	66	September 20, 2024

Defer loading data when cloning repository?

Related topics