Hi,
I am facing the problem that dvc silently uploaded corrupted files (zero-byte files) to my Artifactory remote (http-based remote). This happened not only to me but also other colleagues. This issue mentions zero-byte files caused by VPN issues but I think it might be another underlying problem.
To fix the 500 zero-byte files created for a example.dvc file, I had to iterate 4 times over the following steps:
- dvc push example.dvc
- crosscheck that the size stated in examples.dvc matches the summation of artifacts uploaded to Artifactory.
- Since step 2) showed a discrepancy between both sizes, I searched and deleted all zero-byte artifacts (referenced by example.dvc) in Artifactory incl. the .dir file.
Looping over steps 1-3) 4 times brought me down from 500 corrupted files, to 170, to 50, to 0 corrupted files. I find this behavior very disturbing because it shines an unstable light on DVC - eventhough I really like the tool and the idea.
Unfortunately, reproducing this issue is hard, there seems to be some randomness involved.
My (amature) guesses here are a false handling of http connection disruptions, or unhandled connection timeout?
I verified that all those zero-byte files are non-zero in the local cache (just to make sure that the original uploaded file was not empty in the first place).
Here is my dvc doctor output:
DVC version: 2.15.0 (pip)
---------------------------------
Platform: Python 3.8.10 on Linux-5.13.0-37-generic-x86_64-with-glibc2.29
Supports:
hdfs (fsspec = 2022.5.0, pyarrow = 7.0.0),
webhdfs (fsspec = 2022.5.0),
http (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6),
https (aiohttp = 3.7.4.post0, aiohttp-retry = 2.4.6)
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/sda3
Caches: local
Remotes: https
Workspace directory: ext4 on /dev/sda3
Repo: dvc, git
Thank you very much for any input or debugging help. In general, I am really amazed by DVC and your work and effort you put in this great tool!
Best,
Jo