I am a new user, working in Linux, and I am creating a DVC remote repo on NFS. I have a dataset split across four sub-directories, each one with 10000 images, total size of all four sub-directories is 62GB.
If I do a straight copy (not dvc) of the dataset to the repo location, the copy takes 50 min.
If I do ‘dvc add’ to the remote repo, the command is still running after 3.5 days. I assume that this is not expected? What would be expected time for dvc add to a remote repo, versus a straight copy to the same location?
Hello @peterbix,
we do not have any estimation mechanism that would allow us to tell some accurate expectations when it comes to different setups. We do realize that using DVC with NFS works slowly. We have an issue open for that: https://github.com/iterative/dvc/issues/5562
I would recommend giving it a thumbs up and briefly describing your problem, as it pressurizes us to prioritize issues that affect more users.
As mentioned, the dataset would have tens of thousands of images, and size would be tens of GBs.
Step (1) would be to set up a dvc remote repository.
For some users, the Use Case would be to checkout the repo to local disk, and work with it but not modify it.
For other users, the Use Case would be to modify the dataset, most likely by adding images locally, and then add to the repo. The addition of images could be happening frequently such as everyday.
Hi @Paffciu - would you have any recommendations? We are precluded from using a dvc remote in the cloud - has to be local (inside a secure network). Thanks for any thoughts.
Seems like you are doing it by the book…
I think that NFS might be a reason for that, but we can check one more thing.
We could try to profile some smaller push to confirm that
You would need to have yappin installed (pip install yappi)
You would need to run dvc push --yappi
If you could do that and want to do it, please remember to create new remote folder, so that we will not leave thrash in your main one.
If you are on UNIX,
you can generate a dataset with:
for i in {1..10};
do
dd if=/dev/urandom of=data/$i bs=1m count=10
done
@Paffciu, sorry that the experiments are not in the form just as you suggested (can still do that if it’s helpful). But we did the experiments below. Conclusions are that -
the dvc-add time for making an NFS repo goes up exponentially (or above linear) with dataset size
the dvc-add time for making an NFS repo goes up exponentially with the number of files
Do these conclusions make sense? i.e. is this the expected situation, or is there something unusual on our side?
Clearly the time is all in the dvc-add not the dvc-push. Out of interest in understanding how dvc works, why is that, and why is NFS behaving differently to other types of repo?
————-
1GB dataset with 131 files -
Local repo: total time for dvc-add & -push in seconds
NFS repo: dvc-add takes 3 minutes ; dvc-push takes 13 seconds
—
3GB dataset with 377 files -
Local repo: total time for dvc-add & -push in tens of seconds
NFS repo: dvc-add takes 3:10 hours ; dvc-push takes 52 seconds
—
1GB dataset with 1729 files, a mix of large and small files -
Local repo: total time for dvc-add & -push in tens of seconds
NFS repo: dvc-add takes 17:09 hours ; dvc-push takes 19 seconds
————-
We did not try 3GB with a mix of large and small files because the processing time would have been long.
@peterbix That would not make sense. We expect slow NFS due to the number of filesystem calls, but we do not expect the throttle of transfer bandwidth. Would It be possible to have one more test, 1 GB, but make the number of files 377?
In your experiments, the number of files also increases, which to my knowledge should be the reason.
EDIT: ahh sorry, you are right that number of files seems to be the culprit.
Regretfully it seems that this is the NFS problem, it might get improved in the future after
but I am unable to provide some reasonable alternative in this setup right now.
@Paffciu
Thanks for all your help. Just as a suggestion, you could flag it up in the DVC intro materials that there is a problem when using NFS and datasets over 1GB. Not sure if NFS repos are an unusual choice for your user base, but that could save some time for those new to DVC who are constrained in the location for the repo.
BTW we are now working internally on a different product that assumes data in storage is immutable, and can operate much faster with collection manipulation.
Would you be interested to check the early docs and see of the workflow might fit your cases?
@volkfox
Hello, I’ve encountered the similar problem as above(dvc add and checkout to/from NFS repos), so would you mind sharing that docs to me as well?
Would be very appreciated for that