Implementing simple snapshots of file moves, renames, deletes

audiofile · September 21, 2024, 8:06pm

I have a reasonably large dataset (few TB) of mostly audio files, for use in a variety of machine learning pipelines.

From time to time, we will have labelers correct the structure of the data by simply moving, renaming, or deleting files (sometimes creating or moving folders too).

I’d like to have a reproducible pipeline that, starting from a known “unlabeled” dataset, can apply the same “labels” (ie: moves, renames, deletes) to a large folder of data.

Is DVC a good tool for this? Or should I be using Git-LFS (or something else?

I couldn’t find an examples or tutorials on the docs for this usecase, so perhaps it’s not the right tool. Thanks!

shcheklein · September 21, 2024, 8:56pm

Where do you store your data?
Would it possible to decouple your metadata from your data? Not using file names / directories as a source of labels, but store them in a file, or create a JSON file per each object, etc.

Then, it will be way easier to manipulate your data and version it (you won’t even have to put raw data into DVC - which can be painful for large amounts of data, but rather a file with file names + metadata).

This approach is what our sister tool DataChain takes, btw. If you are interested we can talk more about this - let me know.

audiofile · September 23, 2024, 5:06pm

Thanks for the response! Today we store data in Dropbox’s enterprise version (~15TB), but data is expanding and it’s not a great solution. We will likely take the cost hit and move to S3, and when we need to do large training runs, we’ll copy the final training data (~200GB) to Lambda labs.

It’s possible to decouple, but not without tradeoff. Today we create a .json metadata file for each audio file, and then a separate .json file at each root folder level for each small set of audio files.

However. The folders and filenames (and their hierarchy) in each root folder have human meaning and labelers are extremely quick with syncing a few hundred root folders and then previewing audio, renaming files, and deleting in a traditional file explorer and syncing them back up afterwards. So if we divorce the labels/metadata from the filesystem we’d need to code our own labeling interface to show the data to labelers and have them edit. Or repurpose some other existing labeling UI code. Unless I’m misunderstanding your recommendation?

I don’t know anything about DataChain.

audiofile · October 14, 2024, 10:42pm

I think I have come to the conclusion that a metadata object (file) per folder group + labeling UI is probably the right choice

So now I am thinking through how to best let labelers view assets on S3 and update these metadata JSON files via a web browser.

Curious if there are any great tools around DVC (or otherwise) for this usecase. Some issues in particular are around transcoding audio files to reduce S3 outbound data out of region costs, preventing data corruption for concurrent labelers, and a labeling UI itself.

Topic		Replies	Views
Managing Labels/Annotations Questions	4	85	January 3, 2025
Using DVC outside git Questions	10	1220	January 11, 2022
DVC local storage usecase Questions	6	1605	January 20, 2021
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	779	March 22, 2023
DVC compared with GitLFS for storage and versioning only Questions	12	6909	October 13, 2020

Implementing simple snapshots of file moves, renames, deletes

Related topics