I have a reasonably large dataset (few TB) of mostly audio files, for use in a variety of machine learning pipelines.
From time to time, we will have labelers correct the structure of the data by simply moving, renaming, or deleting files (sometimes creating or moving folders too).
I’d like to have a reproducible pipeline that, starting from a known “unlabeled” dataset, can apply the same “labels” (ie: moves, renames, deletes) to a large folder of data.
Is DVC a good tool for this? Or should I be using Git-LFS (or something else?
I couldn’t find an examples or tutorials on the docs for this usecase, so perhaps it’s not the right tool. Thanks!
1 Like
Where do you store your data?
Would it possible to decouple your metadata from your data? Not using file names / directories as a source of labels, but store them in a file, or create a JSON file per each object, etc.
Then, it will be way easier to manipulate your data and version it (you won’t even have to put raw data into DVC - which can be painful for large amounts of data, but rather a file with file names + metadata).
This approach is what our sister tool DataChain takes, btw. If you are interested we can talk more about this - let me know.
Thanks for the response! Today we store data in Dropbox’s enterprise version (~15TB), but data is expanding and it’s not a great solution. We will likely take the cost hit and move to S3, and when we need to do large training runs, we’ll copy the final training data (~200GB) to Lambda labs.
It’s possible to decouple, but not without tradeoff. Today we create a .json metadata file for each audio file, and then a separate .json file at each root folder level for each small set of audio files.
However. The folders and filenames (and their hierarchy) in each root folder have human meaning and labelers are extremely quick with syncing a few hundred root folders and then previewing audio, renaming files, and deleting in a traditional file explorer and syncing them back up afterwards. So if we divorce the labels/metadata from the filesystem we’d need to code our own labeling interface to show the data to labelers and have them edit. Or repurpose some other existing labeling UI code. Unless I’m misunderstanding your recommendation?
I don’t know anything about DataChain.
I think I have come to the conclusion that a metadata object (file) per folder group + labeling UI is probably the right choice
So now I am thinking through how to best let labelers view assets on S3 and update these metadata JSON files via a web browser.
Curious if there are any great tools around DVC (or otherwise) for this usecase. Some issues in particular are around transcoding audio files to reduce S3 outbound data out of region costs, preventing data corruption for concurrent labelers, and a labeling UI itself.