I think my question has two parts: firstly, is the use of external data (as described here) the best way in my case and, if so, why is it so slow to add files?
The context: my ML code + git repo live on my local machine. I develop this codebase using pycharm, and use pycharm’s remote ssh deployment to run my code on a remote machine where the data lives. The dataset is a large imaging dataset, >200GB with 1000s of files, and will not fit on my local machine. When I run my code, pycharm copies it to the remote machine to run it, but it does not copy the git files so there is no git repo on my remote machine.
As my git repo and data live on different machines, it seemed like adding the external dataset to DVC using ssh would be a good way to go about things. However the dvc add command is taking a long time to run. In fact I tried to run it on a 6GB subset of my data and it errored out:
ERROR: too many open files, please visit <https://error.dvc.org/many-files> to see how to handle this problem
So my questions are:
- Is this a sensible way to be using DVC?
- If so, how can I speed things up? It seems like everything is getting downloaded to my local machine, even though I have set up a remote cache over ssh. Is there any way I can get it to compute the hashes on the remote machine?
Thanks for your help,