Hello , DVC users
I am a newbie who wants to apply dvc to large dataset management.
To understand dvc, I did the following experiment.
## normal transfer case
# In DVC_Main directory
git init
dvc init
dvc remote add -d <storage_name> <local_data_storage_url>
dvc add mesh_dataset # mesh_dataset (40GB)
dvc push
# In DVC_Sub directory
git init
dvc init
#copy & pasted files from ( dvc_main directory) -> .dvc/config , .mesh_dataset.dvc files
dvc pull
The above process is a code that tests whether the data worked in the main directory can be loaded from the sub directory.
But it took too long (about 80 minutes)
So I looked for options and there was a -j option for dvc add/pull/push.
## parallel transfer case
# In DVC_Main directory
git init
dvc init
dvc remote add -d <storage_name> <local_data_storage_url>
dvc remote modify jobs 64
dvc add -j 64 --to-remote mesh_dataset # mesh_dataset (40GB)
# In DVC_Sub directory
git init
dvc init
#copy & pasted files from ( dvc_main directory) -> .dvc/config , .mesh_dataset.dvc files
dvc pull -j 64
The parallel transmission method should be faster than the normal transmission method, but I don’t understand why there is no speed difference.
In fact, the parallel transmission method is slightly faster, but it was a difference that appeared because the ‘dvc push’ process was omitted in the dvc_main directory.
According to the explanation above, I have 1 cpu (8 cores, 14 logical cores), so the default value = 34.
Can you explain why there is no difference in transfer speed between jobs option 32 and 64?
Or does dvc not support parallel transmission of large data?