Large dataset, dvc pull/add/push jobs options

dsa934 · February 3, 2023, 2:36am

Hello , DVC users

I am a newbie who wants to apply dvc to large dataset management.
To understand dvc, I did the following experiment.

## normal transfer case 

# In DVC_Main directory 
git init
dvc init
dvc remote add -d <storage_name>  <local_data_storage_url>

dvc add mesh_dataset   # mesh_dataset (40GB)
dvc push 

# In DVC_Sub directory 
git init
dvc init

#copy & pasted files from ( dvc_main directory)  -> .dvc/config , .mesh_dataset.dvc files 
dvc pull

The above process is a code that tests whether the data worked in the main directory can be loaded from the sub directory.
But it took too long (about 80 minutes)

So I looked for options and there was a -j option for dvc add/pull/push.

## parallel transfer case 

# In DVC_Main directory 
git init
dvc init
dvc remote add -d <storage_name>  <local_data_storage_url>
dvc remote modify jobs 64

dvc add -j 64 --to-remote mesh_dataset   # mesh_dataset (40GB)

# In DVC_Sub directory 
git init
dvc init

#copy & pasted files from ( dvc_main directory)  -> .dvc/config , .mesh_dataset.dvc files 
dvc pull -j 64

The parallel transmission method should be faster than the normal transmission method, but I don’t understand why there is no speed difference.

In fact, the parallel transmission method is slightly faster, but it was a difference that appeared because the ‘dvc push’ process was omitted in the dvc_main directory.

According to the explanation above, I have 1 cpu (8 cores, 14 logical cores), so the default value = 34.

Can you explain why there is no difference in transfer speed between jobs option 32 and 64?

Or does dvc not support parallel transmission of large data?

ronan · February 7, 2023, 12:50pm

Hi @dsa934 !
dvc does push and pull data in parallel, whether you use the -j option or not. However, parallelisation has diminishing returns due to resource constraints (network bandwidth, unparallelisable per-job overhead, …) From your testing, I guess that 32 is already past the threshold where adding more jobs doesn’t bring any speed-up.

dsa934 · February 7, 2023, 2:57pm

Hi @ronan !

According to you basically 32 means threshold (because 1 cpu has 8 cores, formula : 4 * cpu_count() ), does that mean we need to increase the number of actual cpus to improve speed?

However, when I experimented with smaller numbers, such as 12 or 16 instead of 64, there was no change in speed.

Topic		Replies	Views
Advice for versioning many many small files? Questions	8	3621	January 13, 2021
Change the data chunk size in the cache directory Questions	5	663	May 4, 2023
DVC with external data is very slow Questions	3	762	March 17, 2022
Is DVC push (and pull) supposed to take a long time? Questions	3	395	March 7, 2024
Timing to create a dvc repo for a 60GB dataset? Questions	15	950	March 21, 2022

Large dataset, dvc pull/add/push jobs options

Related topics