Dealing with large datasets and file quotas

In the cluster I am using, my working directory at ‘~’ is limited to 200GB and 160K inodes. My dataset has a large number of files which surpasses this limit. For this reason, I am currently storing the dataset in a separate directory that the cluster provides ‘/scratch’, which is there for the purpose of providing a large storage space for non backed-up data.

Is there any way to deal with scenarios like this with DVC? Essentially I’m wondering if I can store the files in a directory in ‘/scratch’ but still be able to use DVC to add, push, or pull the data to a remote repository in a DVC project in my home directory. Thanks for any help you can provide!

You can set the DVC cache to that scratch dir and link files from there to your workspace. See config for an example, and see Large Dataset Optimization for considerations about different link types.

1 Like

Thanks, this looks to be exactly what I am looking for!