Managing Labels/Annotations

Hi DVC Community

I would be generally interested on how others manage labels/annotations.

Assume I have one large dataset managed by DVC, which is further processed in multiple ML projects. The datasets consists of documents, where each file could have multiple labels and sublabels. Two files may have the same label, but different sublabels. Hence, a hierarchical folder structure is not suitable for my use case. Labels and sublabels may change over time, but the content of the files does not change.

A possible solution that I have in mind would be to have all files versioned within a single folder and keep track of the labels in a separate CSV. The CSV track the filename and the corresponding labels. But this would require additional effort to keep the data and the CSV in sync when files are added, deleted or updated.

Are there similar use cases or possible alternative solutions to this?
Thanks in advance for any advice.

1 Like

Hi! @qwirdoti. Question - do you store files (documents) in the cloud originally (and move them to DVC)?

Yes, that is currently our case. Essentially, we pull documents and labels from LabelStudio (cloud buckets) into ML projects.

I think you should look into DataChain. The whole purpose of it is to connect metadata (labels) and documents (references to documents in the cloud) into a “dataset” that can be versioned. Look into this example: GitHub - shcheklein/example-datachain-dvc: An example how to use DataChain and DVC to version data, make project reproducible, track experiments and models and this tutorial: Google Colab for reading JSON labels.

Please share some details - JSON / CSV structure, if you are using YOLO or not, etc - I can help you setup the pipeline / project.

Thanks for the suggestion @shcheklein. In the meantime, we opted for a combination of DVC and fiftyone, which appears to suit our use case very well.

1 Like