Hello everyone,
I am sorry to bump this topic, but I think that it greatly fits the use case that we have been trying to implement for days with my coworker, therefore I think that it is better to have a unified topic on the subject. More specifically, I believe that DVC could benefit from a system dedicated to experiments on large unstructured datasets , which are defined by the fact that the training and test datasets are only modified by the addition and removal of files , rather than the modification of a limited number of single text files (a csv for instance).
The use case is as follows: we have a S3 remote storage, and we want to be able to track a dataset composed of roughly 20 000 images (in other situations, the number of images woud be far greater). The dataset is currently stored on a S3 storage.
We would like to be able to follow this protocol :
- create a tag or a branch once the inital data importation is finished;
- try a first experiment;
- remove or add images before creating a new tag or branch;
- trying a new experiment with the new set of images
The data would be in a data registry while the experiments are conducted from the workspace of a project repo. Ideally, DVC would quickly restore the images or add the new ones between experiments.
While reading this topic, something clicked about our use case, the Catalog concept . We tried to do the exact same thing as @VladKol wanted to implement, i.e adding or removing symbolic links rather than the images themselves, and I understand that it is impossible due to the nature of DVC files.
However, what would be perfect for this case would be a dedicated object (a Catalog object maybe ?) tracked by dvc : the idea would be that this folder would be populated by rather immovable files (deletions would be possible but not frequent and would mainly be garbage collection), and there would be commands to create dvc files which link to specific images in the catalog (like a Catalog Subset ) without any data duplication. Data scientists could use a command to modify the Catalog Subset object by removing or adding images contained in the Catalog: each version of the Subset could be saved as a DVC file, which would enable quick changes to the train and test datasets between experiments by for instance reverting to an older Subset containing fewer or more images. This would reconstruct within the workspace the direct links to the selected images in the subset . The perks would be an absence of data duplication by directly streaming files from the S3 while mainting versioning capabilities.
An idea of syntax could be :
dvc create catalog --name catalog_name --path /path/to/catalog/folder
# Different options could be possible for the subset :
dvc catalog subset subset_name --from-catalog catalog_name --re <regex command to filter images>
dvc catalog subset subset_name --from-catalog catalog_name --rand 0.3 # a number between 0 and 1 to select a sample of images from the catalog
dvc catalog subset subset_name --from-catalog catalog_name --list <a list of images from the catalog>
dvc catalog subset subset_name --from-catalog catalog_name --ui-select # launches a hypothetical UI to select images in the catalog
Now what would be even better to complement this use case (but I am aware that it is way more complicated) would be a command such as “dvc catalog view” which would open an image viewer to take a look at the images in the catalog and a command “dvc catalog-subset view” which would do the same for the subset.
What are your opinions on this use case ? Would it be compatible with dvc’s concepts ? Because I believe that there are a lot of data scientists working on image datasets who would be very interested in such a feature as they lack useful tools for this situation.