DVC with external data is very slow

mark · March 15, 2022, 8:20pm

Hi there,

I think my question has two parts: firstly, is the use of external data (as described here) the best way in my case and, if so, why is it so slow to add files?

The context: my ML code + git repo live on my local machine. I develop this codebase using pycharm, and use pycharm’s remote ssh deployment to run my code on a remote machine where the data lives. The dataset is a large imaging dataset, >200GB with 1000s of files, and will not fit on my local machine. When I run my code, pycharm copies it to the remote machine to run it, but it does not copy the git files so there is no git repo on my remote machine.

As my git repo and data live on different machines, it seemed like adding the external dataset to DVC using ssh would be a good way to go about things. However the dvc add command is taking a long time to run. In fact I tried to run it on a 6GB subset of my data and it errored out:

ERROR: too many open files, please visit <https://error.dvc.org/many-files> to see how to handle this problem

So my questions are:

Is this a sensible way to be using DVC?
If so, how can I speed things up? It seems like everything is getting downloaded to my local machine, even though I have set up a remote cache over ssh. Is there any way I can get it to compute the hashes on the remote machine?

Thanks for your help,
Mark

dberenbaum · March 15, 2022, 9:34pm

Hi Mark,

Using external data in DVC is usually for advanced scenarios where no other setup will work. In your case, it sounds like the main reason you need this is because the primary code lives on your local machine. Would it be possible for you to instead have the code on your remote machine and to use pycharm on your local machine as described in Remote development | PyCharm?

Best,
Dave

dtrifiro · March 17, 2022, 10:21am

Hey Mark, what’s the exact command you’re using? How many cpu cores does the remote server have?

mark · March 17, 2022, 4:11pm

Hi, the command was

dvc add --external ssh://desktop/path/to/data

and my remote machine has 16 cores

@dberenbaum Thanks very much for the suggestion - I wasn’t aware I could do the remote deployment that way round. I’ll give it a go.

Topic		Replies	Views
Dvc get error: Unable to find DVC file Questions	12	3247	June 20, 2021
Dvc_api.get_url is not working with external data Questions	10	1024	June 28, 2022
Timing to create a dvc repo for a 60GB dataset? Questions	15	950	March 21, 2022
Dvc external output add after changing files data in remote is failing Questions	2	821	April 19, 2021
Dvc add and push after adding a couple of images Questions	3	632	November 27, 2023

DVC with external data is very slow

Related topics