Temporary large datasets in DVC

Hi I’m working with audio data and use a DVC pipeline to track our projects.

Most of our projects have a data preprocessing stage e.g. re-sampling & slicing the audio files and then a training stage for the actual training.

The raw data is tracked in DVC and backed up in the Cloud (~1TB), the preprocessed data varies in size but can be also 200 - 400GB large. Considering the size of the preprocessed data and also the fact that I might experiment with different pre-processing hyper-parameters, I haven’t tracked the preprocessed data with DVC yet, either by not including the data in the “dvc.yaml” file or by setting “cache: false” & “persist: true” (to at least get the hash codes for the data).

This wasn’t optimal but worked fine on my small local training server. I’m now looking into options to scale up the training with Cloud Compute. Now I definitely don’t want to download all the raw data and re-perform the preprocessing each time in the Cloud, but I also don’t really want to store every pre-processed dataset permanently in our Cloud storage.

Do you have any recommendations for this problem? Are there ways to temporarily track preprocessed datasets with DVC in the Cloud and remove them after they are not needed anymore? Can I maybe mark the preprocessed dataset in a way that it will always get deleted if I run “dvc gc --cloud”?

Thanks for any help or any resources you could point me to.
Best,
Ramon

You can try to use two different remotes and (remote: name can be set per output) and push preprocessed data to that second remote. From time to time you can just clean that data for example. wdyt?

Hey thanks for the fast reply. Yes that sounds reasonable. I was thinking maybe that’s even an advantage, if I use something like LambdaLabs and push the pre-processed dataset into their Cloud Storage via SSH? That way the data would be directly available to the Cloud instance and probably would save some time and cost.

Will see how far I get. Best,
Ramon

1 Like