Multiple overlapping datasets

kazimpal · December 7, 2020, 2:31pm

Hi,

I’m considering using dvc for a project and am wondering if it fits the following use case. Say I have N images which form a dataset, but for any given experiment I may only want to train on some subset of the images. What I would like is to be able to determine which images in the subset are missing locally and only pull those images from the file store. Ideally these subsets could also have names and I’d be able to just do dvc pull datasubset3. Is something like this possible?

Thanks

jorgeorpinel · December 7, 2020, 6:59pm

Hello,

OK so let’s assume that dataset is a directory which is tracked by DVC.

Once you have determined which files you can pull those specifically, yes. But it has to be done one by one (you can print the entire list to a file and then use a shell script to dvc pull each name). This is what we call “granularity support” in most of our commands, including push and pull.

DVC doesn’t have such a feature, but we’re open to feature requests in GitHub - iterative/dvc: 🦉 ML Experiments Management with Git.

Note that we are also about to release a wildcard feature called “globbing” which you can see here pull: add glob option by ju0gri · Pull Request #5032 · iterative/dvc · GitHub, maybe that will be enough for your case? Depending on your dataset naming/file structure.

Best

Topic		Replies	Views
Is it possible to only pull/get a subfolder from a existing repo Questions	5	4054	February 21, 2022
Working with a small subset of remote data Questions	3	1634	October 31, 2020
Dvc pull --glob Questions	8	1700	October 18, 2022
Why DVC pulls full dataset instead of a single file? Questions	9	432	December 28, 2023
How to pull data from GCS without pipelines Questions	4	702	March 9, 2021

Multiple overlapping datasets

Related topics