DVC with external data is very slow

Hi there,

I think my question has two parts: firstly, is the use of external data (as described here) the best way in my case and, if so, why is it so slow to add files?

The context: my ML code + git repo live on my local machine. I develop this codebase using pycharm, and use pycharm’s remote ssh deployment to run my code on a remote machine where the data lives. The dataset is a large imaging dataset, >200GB with 1000s of files, and will not fit on my local machine. When I run my code, pycharm copies it to the remote machine to run it, but it does not copy the git files so there is no git repo on my remote machine.

As my git repo and data live on different machines, it seemed like adding the external dataset to DVC using ssh would be a good way to go about things. However the dvc add command is taking a long time to run. In fact I tried to run it on a 6GB subset of my data and it errored out:

ERROR: too many open files, please visit <https://error.dvc.org/many-files> to see how to handle this problem

So my questions are:

  1. Is this a sensible way to be using DVC?
  2. If so, how can I speed things up? It seems like everything is getting downloaded to my local machine, even though I have set up a remote cache over ssh. Is there any way I can get it to compute the hashes on the remote machine?

Thanks for your help,
Mark

Hi Mark,

Using external data in DVC is usually for advanced scenarios where no other setup will work. In your case, it sounds like the main reason you need this is because the primary code lives on your local machine. Would it be possible for you to instead have the code on your remote machine and to use pycharm on your local machine as described in Remote development | PyCharm?

Best,
Dave

Hey Mark, what’s the exact command you’re using? How many cpu cores does the remote server have?

Hi, the command was

dvc add --external ssh://desktop/path/to/data

and my remote machine has 16 cores

@dberenbaum Thanks very much for the suggestion - I wasn’t aware I could do the remote deployment that way round. I’ll give it a go.