Timing to create a dvc repo for a 60GB dataset?

peterbix · November 1, 2021, 3:59am

Hello

I am a new user, working in Linux, and I am creating a DVC remote repo on NFS. I have a dataset split across four sub-directories, each one with 10000 images, total size of all four sub-directories is 62GB.

If I do a straight copy (not dvc) of the dataset to the repo location, the copy takes 50 min.

If I do ‘dvc add’ to the remote repo, the command is still running after 3.5 days. I assume that this is not expected? What would be expected time for dvc add to a remote repo, versus a straight copy to the same location?

Thanks, Paul

Paffciu · November 1, 2021, 10:44am

Hello @peterbix,
we do not have any estimation mechanism that would allow us to tell some accurate expectations when it comes to different setups. We do realize that using DVC with NFS works slowly. We have an issue open for that:
https://github.com/iterative/dvc/issues/5562
I would recommend giving it a thumbs up and briefly describing your problem, as it pressurizes us to prioritize issues that affect more users.

peterbix · November 1, 2021, 10:59am

Okay thanks. FYI that for us, dvc is not an option, since tens of GB and bigger is typical dataset size. Thanks again for your help with it.

Paffciu · November 1, 2021, 11:07am

@peterbix
This is understandable.
Could you share some context about your use case? Maybe we can think about some other approach that would be faster.

peterbix · November 1, 2021, 11:18am

Sure, here is context -

As mentioned, the dataset would have tens of thousands of images, and size would be tens of GBs.
Step (1) would be to set up a dvc remote repository.
For some users, the Use Case would be to checkout the repo to local disk, and work with it but not modify it.
For other users, the Use Case would be to modify the dataset, most likely by adding images locally, and then add to the repo. The addition of images could be happening frequently such as everyday.

peterbix · November 2, 2021, 2:09pm

Hi @Paffciu - would you have any recommendations? We are precluded from using a dvc remote in the cloud - has to be local (inside a secure network). Thanks for any thoughts.

Paffciu · November 2, 2021, 4:42pm

@peterbix
Sorry for the late response, one more question:
How do you add the data to the remote?
Is it:

dvc add {data} && dvc push
dvc add data --to-remote
or did you specified a cache in NFS?

peterbix · November 2, 2021, 4:58pm

@Paffciu No worries, very much appreciated your help. I did method (1) -
dvc remote add myremote
dvc add
dvc push

Paffciu · November 2, 2021, 5:32pm

Seems like you are doing it by the book…
I think that NFS might be a reason for that, but we can check one more thing.

We could try to profile some smaller push to confirm that
You would need to have yappin installed (pip install yappi)
You would need to run dvc push --yappi

If you could do that and want to do it, please remember to create new remote folder, so that we will not leave thrash in your main one.

If you are on UNIX,
you can generate a dataset with:

for i in {1..10};
do
	dd if=/dev/urandom of=data/$i bs=1m count=10
done

It will result in dataset of 10 files, 10m each.

peterbix · November 3, 2021, 10:01am

@Paffciu, sorry that the experiments are not in the form just as you suggested (can still do that if it’s helpful). But we did the experiments below. Conclusions are that -

the dvc-add time for making an NFS repo goes up exponentially (or above linear) with dataset size
the dvc-add time for making an NFS repo goes up exponentially with the number of files

Do these conclusions make sense? i.e. is this the expected situation, or is there something unusual on our side?

Clearly the time is all in the dvc-add not the dvc-push. Out of interest in understanding how dvc works, why is that, and why is NFS behaving differently to other types of repo?

————-
1GB dataset with 131 files -

Local repo: total time for dvc-add & -push in seconds
NFS repo: dvc-add takes 3 minutes ; dvc-push takes 13 seconds

—
3GB dataset with 377 files -

Local repo: total time for dvc-add & -push in tens of seconds
NFS repo: dvc-add takes 3:10 hours ; dvc-push takes 52 seconds

—
1GB dataset with 1729 files, a mix of large and small files -

Local repo: total time for dvc-add & -push in tens of seconds
NFS repo: dvc-add takes 17:09 hours ; dvc-push takes 19 seconds
————-

We did not try 3GB with a mix of large and small files because the processing time would have been long.

Paffciu · November 3, 2021, 11:54am

@peterbix
That would not make sense. We expect slow NFS due to the number of filesystem calls, but we do not expect the throttle of transfer bandwidth. Would It be possible to have one more test, 1 GB, but make the number of files 377?
In your experiments, the number of files also increases, which to my knowledge should be the reason.

EDIT: ahh sorry, you are right that number of files seems to be the culprit.

Regretfully it seems that this is the NFS problem, it might get improved in the future after

github.com/iterative/dvc

dvc: QA nfs/cifs/overlay/etc

opened 06:40PM - 07 Mar 21 UTC

efiop

enhancement p2-medium performance

When working with local repos, dvc uses a lot of `exists()` and `stat()` calls, …which is fine with normal filesystems, but could be extremely slow with filesystems like nfs/cifs/etc (e.g. `stat()` can be 100x slower). We should look into conserving such calls, similar to what we do with remotes, where we know that an api call to s3/gdrive/etc is pretty slow, so we try to do as little calls as possible. * https://discord.com/channels/485586884165107732/728693131557732403/818168922393673738 Another issue with nfs/cifs/overlay is problems with sqlite database (e.g. freezeing locks) that we use for local optimization https://github.com/iterative/dvc/issues/4420 . Dvc itself is also often affected by it, which is why we provide an option to use `flufl.lock` as repo lock instead of flock-based lock. For sqlite it is not an option, so a solution might be to place it somewhere on a normal filesystem (e.g. /tmp is usually fine).

but I am unable to provide some reasonable alternative in this setup right now.

peterbix · November 3, 2021, 12:05pm

@Paffciu
Thanks for all your help. Just as a suggestion, you could flag it up in the DVC intro materials that there is a problem when using NFS and datasets over 1GB. Not sure if NFS repos are an unusual choice for your user base, but that could save some time for those new to DVC who are constrained in the location for the repo.

volkfox · December 11, 2021, 3:15am

@peterbix

BTW we are now working internally on a different product that assumes data in storage is immutable, and can operate much faster with collection manipulation.

Would you be interested to check the early docs and see of the workflow might fit your cases?

peterbix · December 11, 2021, 8:35am

Yes please, very interested.

volkfox · December 14, 2021, 12:41am

Sent some docs to your registered email at two**.com

schakal · March 21, 2022, 5:38am

@volkfox
Hello, I’ve encountered the similar problem as above(dvc add and checkout to/from NFS repos), so would you mind sharing that docs to me as well?
Would be very appreciated for that

Topic		Replies	Views
Dvc checkout takes long time Questions	1	1252	March 22, 2022
Setup DVC to work with shared data on NAS server Questions	10	15221	June 12, 2019
DVC with external data is very slow Questions	3	799	March 17, 2022
Dvc add and push after adding a couple of images Questions	3	655	November 27, 2023
DVC Heartbeat - Discord gems Announcements	3	4172	June 27, 2019

Timing to create a dvc repo for a 60GB dataset?

Related topics