Shared cache on NFS is slow

I have multiple users who each can launch their own AWS instance for data analysis, then shut it down when they are done. Rather than having to rebuild the dvc cache on each of these instances when they startup, we want to have a persistent shared cache.

So I’ve created a shared DVC cache on EFS (AWS’s version of NFS) with symlinks in the workspace, but the problem is that ‘dvc fetch’ and ‘dvc checkout’ are now quite slow for certain branches that have 1000s of files tracked by DVC. First, for a ‘dvc fetch’, even if no downloads need to be done from the remote, just “Querying cache…” can take up to 8 minutes. Then ‘dvc checkout’ can take 6 more minutes. Presumably this has to do with NFS communication overhead to list files in the cache (I see the progress bar saying things like “110 files/sec”).

Is there any way to speed up a shared cache stored on NFS, especially when you have 1000s of tracked files?

Could you please get the cprofile dump of your command, It can be get via
adding a --cprofile --cprofile-dump my.prof option in your command.

Is any private information possibly included in these dump files (passwords, AWS keys, file or variable contents, etc)? I am a little hesitant to post the full files because of that.

But I analyzed the files with snakeviz and can tell you that for “dvc fetch” the vast majority of time was spent in hashes_exist > ... > ~:0(<built-in method posix.stat>).

Same for “dvc checkout” – it all comes down to posix.stat.

Guess it is related to https://github.com/iterative/dvc/issues/5562 this one?

Yes, looks like it. I’ll look forward to any improvements that could be made on the DVC side.

For those using Amazon’s EFS, I recommend using General Purpose mode (not max I/O), which has lower latency and thus appears to speed up the stat() calls.