Total data usage and use of external shared cache

Hi together,
I’m fairly new to DVC and I’m trying to understand certain concepts in order to set up a data versioning system.

1.) The documentation gives an explanation about the cache and how to optimize it for large datasets. (Large Dataset Optimization).
However, in terms of pure data size it’s still unclear to me how the total data size increases when using DVC.
Let’s say my total data usage is 100GB before using DVC. Now after setting up DVC with a cache that uses copy link (default) my total data usage would be 300GB (100GB local + 100GB cache + 100GB remote storage).
(See: https://littlebigcode.fr/wp-content/uploads/2022/05/image-20220227-182610.png)
So in my understanding the best I can do in terms of total storage size is to set a reflink/hardlink/symlink so that my local copy of the data now shrinks to a fraction of the total data size. Still, the cache AND the data at the remote location have a total of 200GB. Is it correct that even the lighter links would still result in >=2x of the total storage compared to not using DVC?

2.) My second concern is about the shared cache.
Let’s say I have a mounted drive with data that is accessed by multiple people working on the same project. Let’s say access to data is cheap and infrequent but the data is large and should not be located at the local workspace permanently. Now I want to set up DVC for data versioning. Is a shared external cache suitable for this scenario, e.g. on the same mounted drive as the data? Also if the the data would be on an S3 instead of the mounted drive, would it be useful to set up the external shared cache also on an S3?
Are there any additional advantages of the shared cache besides data de-duplication if data access is already fast?
Maybe someone can share insights about when (not) to use a shared external cache.

Thanks in advance

Yes, compared to using no backup or versioning at all, DVC remote storage will take up ~2x storage across all systems since it does keep a backup somewhere. If you think of the remote storage as a backup, then it’s roughly equivalent storage to doing backups yourself. If you do not need a backup, you may use the cache as your only storage and maintain roughly the same storage that you would have without DVC. If you start making modifications to the data and want to keep the older versions around somewhere, then DVC starts to save storage space.

I think you have the right idea, and setting up a shared cache on a mounted drive makes sense. If you want to use S3 for that cache, I would suggest you mount it to your filesystem using something like GitHub - s3fs-fuse/s3fs-fuse: FUSE-based file system backed by Amazon S3.

1 Like