Hi I’m working with audio data and use a DVC pipeline to track our projects.
Most of our projects have a data preprocessing stage e.g. re-sampling & slicing the audio files and then a training stage for the actual training.
The raw data is tracked in DVC and backed up in the Cloud (~1TB), the preprocessed data varies in size but can be also 200 - 400GB large. Considering the size of the preprocessed data and also the fact that I might experiment with different pre-processing hyper-parameters, I haven’t tracked the preprocessed data with DVC yet, either by not including the data in the “dvc.yaml” file or by setting “cache: false” & “persist: true” (to at least get the hash codes for the data).
This wasn’t optimal but worked fine on my small local training server. I’m now looking into options to scale up the training with Cloud Compute. Now I definitely don’t want to download all the raw data and re-perform the preprocessing each time in the Cloud, but I also don’t really want to store every pre-processed dataset permanently in our Cloud storage.
Do you have any recommendations for this problem? Are there ways to temporarily track preprocessed datasets with DVC in the Cloud and remove them after they are not needed anymore? Can I maybe mark the preprocessed dataset in a way that it will always get deleted if I run “dvc gc --cloud”?
Thanks for any help or any resources you could point me to.
Best,
Ramon