Dealing with large datasets and file quotas

finlay_morrison · February 4, 2024, 1:37pm

In the cluster I am using, my working directory at ‘~’ is limited to 200GB and 160K inodes. My dataset has a large number of files which surpasses this limit. For this reason, I am currently storing the dataset in a separate directory that the cluster provides ‘/scratch’, which is there for the purpose of providing a large storage space for non backed-up data.

Is there any way to deal with scenarios like this with DVC? Essentially I’m wondering if I can store the files in a directory in ‘/scratch’ but still be able to use DVC to add, push, or pull the data to a remote repository in a DVC project in my home directory. Thanks for any help you can provide!

dberenbaum · February 5, 2024, 1:22pm

You can set the DVC cache to that scratch dir and link files from there to your workspace. See config for an example, and see Large Dataset Optimization for considerations about different link types.

finlay_morrison · February 5, 2024, 1:34pm

Thanks, this looks to be exactly what I am looking for!

Topic		Replies	Views
How to add data from "scratch" filesystem directly to "scratch" cache dir Questions	10	771	March 17, 2021
What type of protocole for huge dataset Questions	2	525	March 23, 2021
DVC local storage usecase Questions	6	1636	January 20, 2021
Shared development details Questions	2	854	December 9, 2020
Shared cache directory Questions	14	3176	July 5, 2018

Dealing with large datasets and file quotas

Related topics