Multiple overlapping datasets

Hi,

I’m considering using dvc for a project and am wondering if it fits the following use case. Say I have N images which form a dataset, but for any given experiment I may only want to train on some subset of the images. What I would like is to be able to determine which images in the subset are missing locally and only pull those images from the file store. Ideally these subsets could also have names and I’d be able to just do dvc pull datasubset3. Is something like this possible?

Thanks

Hello,

OK so let’s assume that dataset is a directory which is tracked by DVC.

Once you have determined which files you can pull those specifically, yes. But it has to be done one by one (you can print the entire list to a file and then use a shell script to dvc pull each name). This is what we call “granularity support” in most of our commands, including push and pull.

DVC doesn’t have such a feature, but we’re open to feature requests in GitHub - iterative/dvc: 🦉 ML Experiments Management with Git.

Note that we are also about to release a wildcard feature called “globbing” which you can see here pull: add glob option by ju0gri · Pull Request #5032 · iterative/dvc · GitHub, maybe that will be enough for your case? Depending on your dataset naming/file structure.

Best