Managing Labels/Annotations

qwirdoti · December 5, 2024, 7:41am

Hi DVC Community

I would be generally interested on how others manage labels/annotations.

Assume I have one large dataset managed by DVC, which is further processed in multiple ML projects. The datasets consists of documents, where each file could have multiple labels and sublabels. Two files may have the same label, but different sublabels. Hence, a hierarchical folder structure is not suitable for my use case. Labels and sublabels may change over time, but the content of the files does not change.

A possible solution that I have in mind would be to have all files versioned within a single folder and keep track of the labels in a separate CSV. The CSV track the filename and the corresponding labels. But this would require additional effort to keep the data and the CSV in sync when files are added, deleted or updated.

Are there similar use cases or possible alternative solutions to this?
Thanks in advance for any advice.

shcheklein · December 10, 2024, 6:26pm

Hi! @qwirdoti. Question - do you store files (documents) in the cloud originally (and move them to DVC)?

qwirdoti · December 11, 2024, 7:10am

Yes, that is currently our case. Essentially, we pull documents and labels from LabelStudio (cloud buckets) into ML projects.

shcheklein · December 23, 2024, 6:55pm

I think you should look into DataChain. The whole purpose of it is to connect metadata (labels) and documents (references to documents in the cloud) into a “dataset” that can be versioned. Look into this example: GitHub - shcheklein/example-datachain-dvc: An example how to use DataChain and DVC to version data, make project reproducible, track experiments and models and this tutorial: Google Colab for reading JSON labels.

Please share some details - JSON / CSV structure, if you are using YOLO or not, etc - I can help you setup the pipeline / project.

qwirdoti · January 3, 2025, 8:05am

Thanks for the suggestion @shcheklein. In the meantime, we opted for a combination of DVC and fiftyone, which appears to suit our use case very well.

Topic		Replies	Views
Implementing simple snapshots of file moves, renames, deletes Questions	3	32	October 14, 2024
DVC local storage usecase Questions	6	1605	January 20, 2021
Using DVC outside git Questions	10	1220	January 11, 2022
Add dataset metadata in .dvc file? Questions	7	2632	September 17, 2020
Is it possible to version files independently? Questions	2	261	February 15, 2023

Managing Labels/Annotations

Related topics