I really like the idea of having Data Registries to maintain and re-use datasets. I’m working on an open-source project where we want to apply this concept but not necessarily related to machine learning. We have a collection of images that we want to use on different projects. Images are Chinese radicals and drawings related to those radicals. We call the “Data Registry” a “Media Library”.
In order to know if the DVC could fit our requirements, we have developed a proof of concept. The idea is to have a library which the DVC pointers and a website that uses the images from the library. In the future, we could have other projects using the images, including ML ones.
We want to handle the library with pull requests. Media contributors can add or delete the images. Some repository events trigger workflows that will do some tasks like generating different size/format versions of the images, interacting with the DVC storage, etcetera.
You can read a full explanation of the CI/CD processes here:
I would like to know more use cases like this where DVC is primarily used to implement a versioning system for image collections in a completely automatized way.