Temporary large datasets in DVC

Ramon · February 11, 2025, 4:38pm

Hi I’m working with audio data and use a DVC pipeline to track our projects.

Most of our projects have a data preprocessing stage e.g. re-sampling & slicing the audio files and then a training stage for the actual training.

The raw data is tracked in DVC and backed up in the Cloud (~1TB), the preprocessed data varies in size but can be also 200 - 400GB large. Considering the size of the preprocessed data and also the fact that I might experiment with different pre-processing hyper-parameters, I haven’t tracked the preprocessed data with DVC yet, either by not including the data in the “dvc.yaml” file or by setting “cache: false” & “persist: true” (to at least get the hash codes for the data).

This wasn’t optimal but worked fine on my small local training server. I’m now looking into options to scale up the training with Cloud Compute. Now I definitely don’t want to download all the raw data and re-perform the preprocessing each time in the Cloud, but I also don’t really want to store every pre-processed dataset permanently in our Cloud storage.

Do you have any recommendations for this problem? Are there ways to temporarily track preprocessed datasets with DVC in the Cloud and remove them after they are not needed anymore? Can I maybe mark the preprocessed dataset in a way that it will always get deleted if I run “dvc gc --cloud”?

Thanks for any help or any resources you could point me to.
Best,
Ramon

shcheklein · February 12, 2025, 3:04am

You can try to use two different remotes and (remote: name can be set per output) and push preprocessed data to that second remote. From time to time you can just clean that data for example. wdyt?

Ramon · February 12, 2025, 6:08pm

Hey thanks for the fast reply. Yes that sounds reasonable. I was thinking maybe that’s even an advantage, if I use something like LambdaLabs and push the pre-processed dataset into their Cloud Storage via SSH? That way the data would be directly available to the Cloud instance and probably would save some time and cost.

Will see how far I get. Best,
Ramon

Topic		Replies	Views
Is it Possible to train data in s3 bucket without downloading to local machine with DVC? Questions	11	787	March 22, 2023
Best practice for handling large data Questions	5	2622	April 16, 2021
DVC local storage usecase Questions	6	1619	January 20, 2021
Handle cache to keep the latest version of data Questions	2	854	August 31, 2018
Best practices for data stored on cloud? Questions	0	385	January 4, 2023

Temporary large datasets in DVC

Related topics