Timing to create a dvc repo for a 60GB dataset?

Hello

I am a new user, working in Linux, and I am creating a DVC remote repo on NFS. I have a dataset split across four sub-directories, each one with 10000 images, total size of all four sub-directories is 62GB.

If I do a straight copy (not dvc) of the dataset to the repo location, the copy takes 50 min.

If I do ‘dvc add’ to the remote repo, the command is still running after 3.5 days. I assume that this is not expected? What would be expected time for dvc add to a remote repo, versus a straight copy to the same location?

Thanks, Paul

Hello @peterbix,
we do not have any estimation mechanism that would allow us to tell some accurate expectations when it comes to different setups. We do realize that using DVC with NFS works slowly. We have an issue open for that:
https://github.com/iterative/dvc/issues/5562
I would recommend giving it a thumbs up and briefly describing your problem, as it pressurizes us to prioritize issues that affect more users.

Okay thanks. FYI that for us, dvc is not an option, since tens of GB and bigger is typical dataset size. Thanks again for your help with it.

@peterbix
This is understandable.
Could you share some context about your use case? Maybe we can think about some other approach that would be faster.

Sure, here is context -

  1. As mentioned, the dataset would have tens of thousands of images, and size would be tens of GBs.
  2. Step (1) would be to set up a dvc remote repository.
  3. For some users, the Use Case would be to checkout the repo to local disk, and work with it but not modify it.
  4. For other users, the Use Case would be to modify the dataset, most likely by adding images locally, and then add to the repo. The addition of images could be happening frequently such as everyday.

Hi @Paffciu - would you have any recommendations? We are precluded from using a dvc remote in the cloud - has to be local (inside a secure network). Thanks for any thoughts.

@peterbix
Sorry for the late response, one more question:
How do you add the data to the remote?
Is it:

  1. dvc add {data} && dvc push
  2. dvc add data --to-remote
  3. or did you specified a cache in NFS?

@Paffciu No worries, very much appreciated your help. I did method (1) -
dvc remote add myremote
dvc add
dvc push

Seems like you are doing it by the book…
I think that NFS might be a reason for that, but we can check one more thing.

We could try to profile some smaller push to confirm that
You would need to have yappin installed (pip install yappi)
You would need to run dvc push --yappi

If you could do that and want to do it, please remember to create new remote folder, so that we will not leave thrash in your main one.

If you are on UNIX,
you can generate a dataset with:

for i in {1..10};
do
	dd if=/dev/urandom of=data/$i bs=1m count=10
done

It will result in dataset of 10 files, 10m each.

@Paffciu, sorry that the experiments are not in the form just as you suggested (can still do that if it’s helpful). But we did the experiments below. Conclusions are that -

  • the dvc-add time for making an NFS repo goes up exponentially (or above linear) with dataset size
  • the dvc-add time for making an NFS repo goes up exponentially with the number of files

Do these conclusions make sense? i.e. is this the expected situation, or is there something unusual on our side?

Clearly the time is all in the dvc-add not the dvc-push. Out of interest in understanding how dvc works, why is that, and why is NFS behaving differently to other types of repo?

————-
1GB dataset with 131 files -

Local repo: total time for dvc-add & -push in seconds
NFS repo: dvc-add takes 3 minutes ; dvc-push takes 13 seconds


3GB dataset with 377 files -

Local repo: total time for dvc-add & -push in tens of seconds
NFS repo: dvc-add takes 3:10 hours ; dvc-push takes 52 seconds


1GB dataset with 1729 files, a mix of large and small files -

Local repo: total time for dvc-add & -push in tens of seconds
NFS repo: dvc-add takes 17:09 hours ; dvc-push takes 19 seconds
————-

We did not try 3GB with a mix of large and small files because the processing time would have been long.

@peterbix
That would not make sense. We expect slow NFS due to the number of filesystem calls, but we do not expect the throttle of transfer bandwidth. Would It be possible to have one more test, 1 GB, but make the number of files 377?
In your experiments, the number of files also increases, which to my knowledge should be the reason.

EDIT: ahh sorry, you are right that number of files seems to be the culprit.

Regretfully it seems that this is the NFS problem, it might get improved in the future after

but I am unable to provide some reasonable alternative in this setup right now.

@Paffciu
Thanks for all your help. Just as a suggestion, you could flag it up in the DVC intro materials that there is a problem when using NFS and datasets over 1GB. Not sure if NFS repos are an unusual choice for your user base, but that could save some time for those new to DVC who are constrained in the location for the repo.

1 Like

@peterbix

BTW we are now working internally on a different product that assumes data in storage is immutable, and can operate much faster with collection manipulation.

Would you be interested to check the early docs and see of the workflow might fit your cases?

Yes please, very interested.

Sent some docs to your registered email at two**.com

@volkfox
Hello, I’ve encountered the similar problem as above(dvc add and checkout to/from NFS repos), so would you mind sharing that docs to me as well?
Would be very appreciated for that :slight_smile: