Hi!
I’ve started to use DVC recently and experienced slow upload speed toward my remote storage.
I suspect this is related to the large amount of data chunks in the cache directory that are to be uploaded.
As far as I’ve seen, the size of those chunks is 1MB.
Is there a way to change that value so I have fewer, but bigger, blocks of data?
Many thanks in advance for your help!
This is usually determined by the underlying library used for accessing the remote storage API and is not configurable in DVC. Could you please run dvc doctor
and post the output here?
Ah, I see! Thank you for your reply!
Sure, here you go:
-> dvc doctor
DVC version: 2.51.0 (pip)
-------------------------
Platform: Python 3.9.16 on Linux-5.4.0-135-generic-x86_64-with-glibc2.27
Subprojects:
dvc_data = 0.44.1
dvc_objects = 0.21.1
dvc_render = 0.3.1
dvc_task = 0.2.0
scmrepo = 0.1.16
Supports:
http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
ssh (sshfs = 2023.1.0),
webdav (webdav4 = 0.9.8),
webdavs (webdav4 = 0.9.8)
->
Cache directory: ext4 on /dev/mapper/ubuntu--vg-root
Caches: local
Remotes: webdavs
Workspace directory: ext4 on /dev/mapper/ubuntu--vg-root
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/cd00bdc44282bd9bb00d035a4a91b52f
It looks like you’re using a webdav remote, and yeah, in this case we don’t support setting the chunk size for webdav file transfers in DVC. Feel free to open a feature request for this in our github repo.
I think the blocksize is 2MB, same as all filesystems in fsspec. But I think the problem is not with chunking but with number of concurrent transfers which is 4 * logical CPU by default.
You can try setting jobs
in config for the remote, or pass --jobs <n>
flag to the dvc push
.
dvc remote modify <remote_name> jobs 4
See remote modify.
or,
dvc push --jobs 4
See push --jobs.
Thank you for your help, I’ll look into that!